wchar

List overview All Threads
Download

newer

older

Gnu Pascal on Fedora/Red Hat

Better Pascal mode for Emacs?

Markus Gerwinski

3 Oct 2004 3 Oct '04

1:19 a.m.

Hello,

does GPC have a data type that corresponds to the wchar_t data type of gcc? If not, is there an easy way to port it and make sure it always has the same size as in gcc?

Yours,

Markus

Show replies by date

Waldek Hebisch

3 Oct 3 Oct

1:49 a.m.

Markus Gerwinski wrote:

...

Hello,

does GPC have a data type that corresponds to the wchar_t data type of gcc? If not, is there an easy way to port it and make sure it always has the same size as in gcc?

AFAICS we have no corresponding type. Adding such type to GPC is trivial, the hardest thing is choosing name -- 'WideChar' would be natural choice but in Delphi 'WideChar' is different (Windows insists on 16-bit chars). Hmm, `WcharInt' ?

Note that making such type would be purely for interfacing. Adding proper Pascal support for wide characters is much bigger job -- think about wide strings (literals!).

By the way, I belive that GNU autoconf have tests for size of wchar_t, so if you use autotools you should be able to define Pascal equivalent at configure time.

-- Waldek Hebisch hebisch@math.uni.wroc.pl

Scott Moore

3:45 a.m.

Waldek Hebisch wrote:

...

Markus Gerwinski wrote:

...
Hello,

does GPC have a data type that corresponds to the wchar_t data type of gcc? If not, is there an easy way to port it and make sure it always has the same size as in gcc?

AFAICS we have no corresponding type. Adding such type to GPC is trivial, the hardest thing is choosing name -- 'WideChar' would be natural choice but in Delphi 'WideChar' is different (Windows insists on 16-bit chars). Hmm, `WcharInt' ?

Note that making such type would be purely for interfacing. Adding proper Pascal support for wide characters is much bigger job -- think about wide strings (literals!).

By the way, I belive that GNU autoconf have tests for size of wchar_t, so if you use autotools you should be able to define Pascal equivalent at configure time.

Wide strings, wide sets, ability to read and write wide characters and probably also utf-8 to and from files.

To me it well illustrates the fallacy of using C as an example for new extensions. Pascal is not C, and has fundamentally different aims.

-- Samiam is Scott A. Moore Personal web site: http:/www.moorecad.com/scott My electronics engineering consulting site: http://www.moorecad.com ISO 7185 Standard Pascal web site: http://www.moorecad.com/standardpascal Classic Basic Games web site: http://www.moorecad.com/classicbasic The IP Pascal web site, a high performance, highly portable ISO 7185 Pascal compiler system: http://www.moorecad.com/ippas Good does not always win. But good is more patient.

Waldek Hebisch

5:50 p.m.

Scott Moore wrote:

...

Wide strings, wide sets, ability to read and write wide characters and probably also utf-8 to and from files.

To me it well illustrates the fallacy of using C as an example for new extensions. Pascal is not C, and has fundamentally different aims.

Could you be more explicit here? Do you think that we should have a single char type? If yes, then would you limit that type to 8 bits (or maybe 16)?

-- Waldek Hebisch hebisch@math.uni.wroc.pl

Scott Moore

11:17 p.m.

Waldek Hebisch wrote:

...

Scott Moore wrote:

...
Wide strings, wide sets, ability to read and write wide characters and probably also utf-8 to and from files.

To me it well illustrates the fallacy of using C as an example for new extensions. Pascal is not C, and has fundamentally different aims.

Could you be more explicit here? Do you think that we should have a single char type? If yes, then would you limit that type to 8 bits (or maybe 16)?

There are going to be all kinds of opinions here. I believe Unicode support makes the most sense for Pascal by embracing it. In Pascal "char" is a distinct type. No assumptions are made about its size or characteristics other than ordering. Char is not used to mean "byte like entity" as in C.

So in a "wide mode program", char is 16 bits, set of char is still possible, meaning a minimum set is 65535 elements (and implementing sparse sets becomes more important), and string constants and data are 16 bit. This makes sense because Unicode is a superset of ASCII, i.e., a program that follows the rules of ISO 7185 Pascal should be recompilable for Unicode mode without any changes whatever.

Issue #2, and separate from my point of view, is what form the support libraries take on such a program. Obviously the 16 bit characters must be accommodated, but in addition, (at minimum) the following options should exist for input and output of characters (as text files or even "file of char"):

1. I/O as ASCII, i.e., reduce and expand characters from 8 bit in files. 2. I/O as Unicode 16 bit characters, in "big endian" format. 3. As 3, but "little endian" format (note that Unicode has an "endian" marker to tell readers what endian format is in a file).

Now, putting my salesman hat on, I believe (1) should be replaced with UTF-8. UTF-8 is completely upward and downward compatible with Unicode and ASCII, so it can replace ASCII format in a Unicode enabled program. If the program deals in ASCII only (even though it manipulates it as 16 bit Unicode), it will be I/Oed as ASCII. If any Unicode is used (characters > 127), then Unicode is generated (in UTF-8 escapes format).

In addition, UTF-8 is gaining in popularity. UTF-8 manipulation inside a program is ungainly, since you have to compensate for multibyte characters, but UTF-8 as an I/O format with full 16 bit Unicode processing internally is, IMHO, an ideal solution.

Waldek Hebisch

4 Oct 4 Oct

2:05 a.m.

Scott Moore wrote:

...

Waldek Hebisch wrote:

...
Scott Moore wrote:

...
Wide strings, wide sets, ability to read and write wide characters and probably also utf-8 to and from files.

To me it well illustrates the fallacy of using C as an example for new extensions. Pascal is not C, and has fundamentally different aims.

Could you be more explicit here? Do you think that we should have a single char type? If yes, then would you limit that type to 8 bits (or maybe 16)?

There are going to be all kinds of opinions here. I believe Unicode support makes the most sense for Pascal by embracing it. In Pascal "char" is a distinct type. No assumptions are made about its size or characteristics other than ordering. Char is not used to mean "byte like entity" as in C.

So in a "wide mode program", char is 16 bits, set of char is still possible, meaning a minimum set is 65535 elements (and implementing sparse sets becomes more important), and string constants and data are 16 bit. This makes sense because Unicode is a superset of ASCII, i.e., a program that follows the rules of ISO 7185 Pascal should be recompilable for Unicode mode without any changes whatever.

I do not like the concept of "compilation modes". Unicode is not merely a huge collections of code-points -- trurly Unicode aware program is likely to use different algorithms and data structures. Sure, many program will work without changes, but so using 8-bit bytes and UTF-8.

Note also that full Unicode is 20 bits (and because of combining chars still you cannot indentify characters with Unicode code-points). Glibc normally uses 32-bit 'wchar_t', which in another argument for 32-bit chars.

It is "good Wirth/Pascal" tradition to offer only one size. But IHMO it is _not_ GNU Pascal tradition. One of GPC selling points is interfacing to GNU C, so we should provide a type matching with 'wchar_t'. We try to be compatible with all significant Pascal dialects (including Delphi), and Delphi for long time had two (maybe more) charactes types. Also, we have a bunch of integer types. So I see no reason to insist on single char type. There is a reason to insist on single type for string literls. However if normal string can represent enough codes and if we provide builtin conversion function (builtin, because we want to constant-fold them), than can initialse other types via explicit conversion.

...

Issue #2, and separate from my point of view, is what form the support libraries take on such a program. Obviously the 16 bit characters must be accommodated, but in addition, (at minimum) the following options should exist for input and output of characters (as text files or even "file of char"):

I/O as ASCII, i.e., reduce and expand characters from 8 bit in files.

I/O as Unicode 16 bit characters, in "big endian" format.

As 3, but "little endian" format (note that Unicode has an "endian"

marker to tell readers what endian format is in a file).

Now, putting my salesman hat on, I believe (1) should be replaced with UTF-8.

I am already sold on UTF-8 :). But I have found that in practice Unicode processing can be 100 times slower than ASCII processing (on the same pure ASCII data). So to make Unicode practical one have to employ various tricks. And such ticks involve choice of data representation. For some problems UTF-32 (or UTF-16) is the fastest. But for others UTF-8 is both fastest and simplest -- AFAICS for example in many cases regex matching (or parsing) of UTF-8 data can be as fast as matching pure 8-bit data (and much faster then UTF-16/32). As I wrote, I think it is best to give the programmer choice.

Anyway, regarless of what solution we choose to support Unicode, we need a type for interfacing to C. If we think that the type is only for interfacing, then name like 'WcharInt' (and making it an integer type) make sense.

I personally think that we may use such type as Pascal wide character type, but if that is too controversial, then I propose to just add an interfacing type now and postpone the question of proper Unicode support.

-- Waldek Hebisch hebisch@math.uni.wroc.pl

Frank Heckenbach

5:55 a.m.

Waldek Hebisch wrote:

...

Scott Moore wrote:

...
Waldek Hebisch wrote:

...
Scott Moore wrote:

...
Wide strings, wide sets, ability to read and write wide characters and probably also utf-8 to and from files.

To me it well illustrates the fallacy of using C as an example for new extensions. Pascal is not C, and has fundamentally different aims.

Could you be more explicit here? Do you think that we should have a single char type? If yes, then would you limit that type to 8 bits (or maybe 16)?

There are going to be all kinds of opinions here. I believe Unicode support makes the most sense for Pascal by embracing it. In Pascal "char" is a distinct type. No assumptions are made about its size or characteristics other than ordering. Char is not used to mean "byte like entity" as in C.

So in a "wide mode program", char is 16 bits, set of char is still possible, meaning a minimum set is 65535 elements (and implementing sparse sets becomes more important), and string constants and data are 16 bit. This makes sense because Unicode is a superset of ASCII, i.e., a program that follows the rules of ISO 7185 Pascal should be recompilable for Unicode mode without any changes whatever.

I do not like the concept of "compilation modes". Unicode is not merely a huge collections of code-points -- trurly Unicode aware program is likely to use different algorithms and data structures. Sure, many program will work without changes, but so using 8-bit bytes and UTF-8.

I agree. All units/libraries would have to exist in both compilation modes (and would need to be tested twice), etc. And, of course, once other issues appear that "want" to have compilation modes, we get to have 2^n copies of libraries which is even less practical.

...

Note also that full Unicode is 20 bits (and because of combining chars still you cannot indentify characters with Unicode code-points). Glibc normally uses 32-bit 'wchar_t', which in another argument for 32-bit chars.

I also agree. If anything then either 32 bits immediately (or just match C's wchar_t, which the backend seems to provide us -- I still have to check in detail).

...

It is "good Wirth/Pascal" tradition to offer only one size. But IHMO it is _not_ GNU Pascal tradition.

Of course, even Wirth's Pascal has integer subranges, so if we could define ASCII char as a subrange of Unicode char, we might be fine. But this would mean (a) `Char' would be Unicode (the standard type must be the biggest if we want to follow the standard's "spirit" here) which is too big a change I fear, (b) it would not cater for a UTF8 type (in particular such a string type) which is neither an array of ASCII chars or of Unicode chars.

So I think we should (must) leave `Char' as it is. Besides the usual suspects -- binary files and other protocols which depend on data type layout -- changing `Char' would mean breaking most programs that handle text(!) files with 7 bit (i.e., ASCII) and 8 bit charsets. We can't realistically do that.

BTW, Scott, you argued partly as if Pascal was based on ASCII. As you certainly know, this is not required, and in fact, many non-English speaking countries use an extended charset (e.g. ISO-8859-n). I do this myself with GPC today (of course, not for program identifiers, but for `Char' data, just to avoid confusion). While Latin1 (ISO-8859-1), and only this one, is upward compatible to Unicode, none of them are compatible to UTF-8 (except for the ASCII subset), so your compatibility arguments already fall down here. And in your I/O list, you'd definitely have to add I/O in an 8 bit charset -- where how to select the charset to convert Unicode to and from is another question, but there must be an (easy) way, because that's what a large part of the world uses today.

...

One of GPC selling points is interfacing to GNU C, so we should provide a type matching with 'wchar_t'.

Agreed.

...

We try to be compatible with all significant Pascal dialects (including Delphi), and Delphi for long time had two (maybe more) charactes types.

Of course, as often, badly named -- if you mean `AnsiChar'. Which ANSI standard does it refer to? (I actually don't know American standards too well. I suppose Unicode has an ANSI number, but so does ASCII, I suppose, and perhaps Latin1 etc.) If read in a Pascal context, it would sound like the `Char' type of ANSI Pascal (which is equivalent to ISO Pascal). To old-time Unix users, "ansi" is better known as a terminal type, and to old-time Dos users it also refers to "ansi"-like terminal sequences, sometimes, for some contorted reasons, plus IBM PC/MS DOS specific 8 bit character sets (which are quite unlike ISO-8859-n or Unicode).

So at best, the name seems ambiguous and confusion, and as often, I'd rather build in another type name, and make (leave) `AnsiChar' an "only Borland compatibility" thing.

...

Also, we have a bunch of integer types. So I see no reason to insist on single char type. There is a reason to insist on single type for string literls. However if normal string can represent enough codes and if we provide builtin conversion function (builtin, because we want to constant-fold them), than can initialse other types via explicit conversion.

In principle I agree, though I shudder at the tought of implementing the string types this will need and their conversions. But it might be inevitable in the long run ...

...

Anyway, regarless of what solution we choose to support Unicode, we need a type for interfacing to C. If we think that the type is only for interfacing, then name like 'WcharInt' (and making it an integer type) make sense.

If it is *only* for interfacing, I'd even suggest something like `CWCharT' or `CWideChar', to make that clear, just like we have `CString', `CInteger' (in the next release) etc.

But if we foresee to use it in Pascal later, I'd rather use a more readable name instead. Perhaps just `WideChar' (but what does "wide" mean, isn't an "m" wider than an "i"? ;-).

And perhaps we could even make it an "abstract ordinal type" for now, i.e., like an enum type (with a rather large range though) whose identifiers were forgotten (as in, not exported from a module). So one could use ordinal functions like `Ord', `Succ' etc., but not much more. This should avoid breaking too many things if we'll actually turn it into a char type sometime.

...

I personally think that we may use such type as Pascal wide character type, but if that is too controversial, then I propose to just add an interfacing type now and postpone the question of proper Unicode support.

I'd also postpone it (since we've both probably have enough other things to do first), but implementing such a type (such as, layout compatible to C's `wchar_t', abstract ordinal type, meaningful name) might be reasonable now ...

Frank

-- Frank Heckenbach, frank@g-n-u.de, http://fjf.gnu.de/, 7977168E GPC To-Do list, latest features, fixed bugs: http://www.gnu-pascal.de/todo.html NEW! GPC download signing key: ACB3 79B2 7EB2 B7A7 EFDE D101 CD02 4C9D 0FE0 E5E8

Scott Moore

8:29 a.m.

Frank Heckenbach wrote:

...

BTW, Scott, you argued partly as if Pascal was based on ASCII. As you certainly know, this is not required, and in fact, many non-English speaking countries use an extended charset (e.g. ISO-8859-n). I do this myself with GPC today (of course, not for program identifiers, but for `Char' data, just to avoid confusion). While Latin1 (ISO-8859-1), and only this one, is upward compatible to Unicode, none of them are compatible to UTF-8 (except for the ASCII subset), so your compatibility arguments already fall down here. And in your I/O list, you'd definitely have to add I/O in an 8 bit charset -- where how to select the charset to convert Unicode to and from is another question, but there must be an (easy) way, because that's what a large part of the world uses today.

Didn't mean to sound Amerocentric :-)

My understanding of the ISO pages is that the characters outside of ASCII are in the > 127 codes. So, for example, IP Pascal specifically leaves the 8th bit unmolested, so would I/O other ISO code pages ok, and would accept ISO pages as source, since it treats c < 32 or c > 127 as characters to be ignored.

UTF-8/Unicode is certainly not compatible with the ISO code page idea, but rather replaces it. So certainly, UTF-8 is designed to be compatible with the ASCII code set and no other.

My take on it is that I support ISO code pages in the 8 bit mode, and Unicode replaces ISO code pages in the 16 bit mode. Does the upward compatibility suck for Europe ? Certainly. I see the resolution of that being Unicode internal processing, i.e., "world centered" code. The beauty of UTF-8 (and other forms) is that nobody has to know or care that my programs are Unicode internally.

Frank Heckenbach

5:02 p.m.

Scott Moore wrote:

...

Frank Heckenbach wrote:

...
BTW, Scott, you argued partly as if Pascal was based on ASCII. As you certainly know, this is not required, and in fact, many non-English speaking countries use an extended charset (e.g. ISO-8859-n). I do this myself with GPC today (of course, not for program identifiers, but for `Char' data, just to avoid confusion). While Latin1 (ISO-8859-1), and only this one, is upward compatible to Unicode, none of them are compatible to UTF-8 (except for the ASCII subset), so your compatibility arguments already fall down here. And in your I/O list, you'd definitely have to add I/O in an 8 bit charset -- where how to select the charset to convert Unicode to and from is another question, but there must be an (easy) way, because that's what a large part of the world uses today.

Didn't mean to sound Amerocentric :-)

My understanding of the ISO pages is that the characters outside of ASCII are in the > 127 codes. So, for example, IP Pascal specifically leaves the 8th bit unmolested, so would I/O other ISO code pages ok, and would accept ISO pages as source, since it treats c < 32 or c > 127 as characters to be ignored.

That's a valid decision, according to ISO Pascal, but not the one we've made, or which I personally like. Ignoring control characters is not exactly my idea, and characters > 127 (usually interpreted in ISO-8859-n) have been in use for a long time ...

...

UTF-8/Unicode is certainly not compatible with the ISO code page idea, but rather replaces it. So certainly, UTF-8 is designed to be compatible with the ASCII code set and no other.

Exactly.

...

My take on it is that I support ISO code pages in the 8 bit mode, and Unicode replaces ISO code pages in the 16 bit mode. Does the upward compatibility suck for Europe ?

Not too much in Western Europe, since Latin1 is (intentionally, of course) a proper subset of Unicode. But still, of course, files in 8 bit Latin1 and UTF-8 are not compatible. There will be both kinds of files to deal with, apart from 16 bit (perhaps 20 bit, stored as 32 bit) Unicode files, so a full solution will probable have to support them all.

...

Certainly. I see the resolution of that being Unicode internal processing, i.e., "world centered" code. The beauty of UTF-8 (and other forms) is that nobody has to know or care that my programs are Unicode internally.

Mostly yes. But when, e.g., storing data (even consisting of only Latin1, but not only ASCII characters) in a file, there is a difference between UTF-8 and 8 bit coding.

Frank

Scott Moore

9:01 p.m.

Frank Heckenbach wrote:

...

...
My understanding of the ISO pages is that the characters outside of ASCII are in the > 127 codes. So, for example, IP Pascal specifically leaves the 8th bit unmolested, so would I/O other ISO code pages ok, and would accept ISO pages as source, since it treats c < 32 or c > 127 as characters to be ignored.

That's a valid decision, according to ISO Pascal, but not the one we've made, or which I personally like. Ignoring control characters is not exactly my idea, and characters > 127 (usually interpreted in ISO-8859-n) have been in use for a long time ...

Probally that needs clarification. When reading the source, IP "ignores" c < 32 and c > 127, as in does not try to parse or otherwise deal with them, but allows them to appear in the source without an error. That means that c < 32 and c > 127 characters can appear in the listing without fault. This means that ISO code page characters can be used in comments. Since Pascal does not require any characters for the language outside 32 <= c <= 127, this gives ISO code page capability without relying on any particular code page format.

Frank Heckenbach

5 Oct 5 Oct

4:50 a.m.

Scott Moore wrote:

...

Frank Heckenbach wrote:

...
...
My understanding of the ISO pages is that the characters outside of ASCII are in the > 127 codes. So, for example, IP Pascal specifically leaves the 8th bit unmolested, so would I/O other ISO code pages ok, and would accept ISO pages as source, since it treats c < 32 or c > 127 as characters to be ignored.

That's a valid decision, according to ISO Pascal, but not the one we've made, or which I personally like. Ignoring control characters is not exactly my idea, and characters > 127 (usually interpreted in ISO-8859-n) have been in use for a long time ...

Probally that needs clarification. When reading the source, IP "ignores" c < 32 and c > 127, as in does not try to parse or otherwise deal with them, but allows them to appear in the source without an error. That means that c < 32 and c > 127 characters can appear in the listing without fault.

This means that ISO code page characters can be used in comments. Since Pascal does not require any characters for the language outside 32 <= c <= 127, this gives ISO code page capability without relying on any particular code page format.

Ah, you're talking about source code. The problem is not so serious here (except for string literals). Data compatibility is the main concern, AFAICS.

Frank

Scott Moore

9:03 a.m.

Frank Heckenbach wrote:

...

Scott Moore wrote:

...
Frank Heckenbach wrote:

...
...
My understanding of the ISO pages is that the characters outside of ASCII are in the > 127 codes. So, for example, IP Pascal specifically leaves the 8th bit unmolested, so would I/O other ISO code pages ok, and would accept ISO pages as source, since it treats c < 32 or c > 127 as characters to be ignored.

That's a valid decision, according to ISO Pascal, but not the one we've made, or which I personally like. Ignoring control characters is not exactly my idea, and characters > 127 (usually interpreted in ISO-8859-n) have been in use for a long time ...

Probally that needs clarification. When reading the source, IP "ignores" c < 32 and c > 127, as in does not try to parse or otherwise deal with them, but allows them to appear in the source without an error. That means that c < 32 and c > 127 characters can appear in the listing without fault.

This means that ISO code page characters can be used in comments. Since Pascal does not require any characters for the language outside 32 <= c <= 127, this gives ISO code page capability without relying on any particular code page format.

Ah, you're talking about source code. The problem is not so serious here (except for string literals). Data compatibility is the main concern, AFAICS.

Frank

For data the situation is not much different. The program does what it likes with the c > 127 codes, which means that it is code page independent. Yes, it does argue for two modes, 8 bit characters which can manipulate code pages, and Unicode/UTF-8 modes which handle the rest. And two different libraries.

Prof. A Olowofoyeku (The African Chief)

4 Oct 4 Oct

12:12 p.m.

On 4 Oct 2004 at 5:55, Frank Heckenbach wrote:

[...]

...

...
Anyway, regarless of what solution we choose to support Unicode, we need a type for interfacing to C. If we think that the type is only for interfacing, then name like 'WcharInt' (and making it an integer type) make sense.

If it is *only* for interfacing, I'd even suggest something like `CWCharT' or `CWideChar', to make that clear, just like we have `CString', `CInteger' (in the next release) etc.

But if we foresee to use it in Pascal later, I'd rather use a more readable name instead. Perhaps just `WideChar' (but what does "wide" mean, isn't an "m" wider than an "i"? ;-).

I believe that, in Delphi, "WideChar" is the term used for Unicode string types.

Best regards, The Chief --------- Prof. Abimbola Olowofoyeku (The African Chief) Web: http://www.bigfoot.com/~african_chief/

CBFalconer

2:25 p.m.

Frank Heckenbach wrote:

...

Waldek Hebisch wrote:

... snip ...

...

...
I do not like the concept of "compilation modes". Unicode is not merely a huge collections of code-points -- trurly Unicode aware program is likely to use different algorithms and data structures. Sure, many program will work without changes, but so using 8-bit bytes and UTF-8.

I agree. All units/libraries would have to exist in both compilation modes (and would need to be tested twice), etc. And, of course, once other issues appear that "want" to have compilation modes, we get to have 2^n copies of libraries which is even less practical.

Not if the internal char type is 32 bit. Once that decision is made the only problem is how to supply and describe the file interfaces, and (of secondary importance) interconnection with C coding. I think we can already eliminate the need for compact char internal representation for embedded systems, because the fixed overhead is already monstrous.

...

...
Note also that full Unicode is 20 bits (and because of combining chars still you cannot indentify characters with Unicode code- points). Glibc normally uses 32-bit 'wchar_t', which in another argument for 32-bit chars.

I also agree. If anything then either 32 bits immediately (or just match C's wchar_t, which the backend seems to provide us -- I still have to check in detail).

...
It is "good Wirth/Pascal" tradition to offer only one size. But IHMO it is _not_ GNU Pascal tradition.

Of course, even Wirth's Pascal has integer subranges, so if we could define ASCII char as a subrange of Unicode char, we might be fine. But this would mean (a) `Char' would be Unicode (the standard type must be the biggest if we want to follow the standard's "spirit" here) which is too big a change I fear, (b) it would not cater for a UTF8 type (in particular such a string type) which is neither an array of ASCII chars or of Unicode chars.

So I think we should (must) leave `Char' as it is. Besides the usual suspects -- binary files and other protocols which depend on data type layout -- changing `Char' would mean breaking most programs that handle text(!) files with 7 bit (i.e., ASCII) and 8 bit charsets. We can't realistically do that.

Binary file interfaces can be simply handled with a subrange of integer, such as "TYPE byte = 0..255;". This means the system has to adjust storage usage to the cardinality of the subranges. UTF8 need never be an internal format, as long as routines are provided to convert between UTF8 strings and internal Unicode strings.

The amount of storage dedicated to a char need not affect the programs. My PascalP of 20 years ago used 16 bit storage on the HP3000, and 8 bit on byte addressing machines. Of course the HP3000 didn't actually use any chars requiring over 8 bits. Both machines generated identical output from identical input.

Using wide internal char storage caters to future machines that are not byte addressing, or that, in C terms, have a much larger byte.

...

... snip ...

...

...
I personally think that we may use such type as Pascal wide character type, but if that is too controversial, then I propose to just add an interfacing type now and postpone the question of proper Unicode support.

I'd also postpone it (since we've both probably have enough other things to do first), but implementing such a type (such as, layout compatible to C's `wchar_t', abstract ordinal type, meaningful name) might be reasonable now ...

I agree that postponement is in order. Full range-checking should have priority. However things should be done with a view to future paths. This may well include two types of text files, say text and atext (and maybe utext). The narrower forms can be supplied by subrange definitions at the outer 0 scope level, making them easily customizable.

-- Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net) Available for consulting/temporary embedded and systems. http://cbfalconer.home.att.net USE worldnet address!

Frank Heckenbach

5:12 p.m.

CBFalconer wrote:

...

Frank Heckenbach wrote:

...
Waldek Hebisch wrote:

... snip ...

...
...
I do not like the concept of "compilation modes". Unicode is not merely a huge collections of code-points -- trurly Unicode aware program is likely to use different algorithms and data structures. Sure, many program will work without changes, but so using 8-bit bytes and UTF-8.

I agree. All units/libraries would have to exist in both compilation modes (and would need to be tested twice), etc. And, of course, once other issues appear that "want" to have compilation modes, we get to have 2^n copies of libraries which is even less practical.

Not if the internal char type is 32 bit.

But I disagree already here. Even if we can ignore the (4 times) bigger memory requirement, there is additional conversion to do, and I don't think any plain-old text processing program should endure that ...

...

Once that decision is made the only problem is how to supply and describe the file interfaces, and (of secondary importance) interconnection with C coding. I think we can already eliminate the need for compact char internal representation for embedded systems, because the fixed overhead is already monstrous.

Not necessarily. As you say, that overhead is fixed, while 32 bit chars make a difference proportional to the size of data. I.e., you can store only 1/4 of text data in a given memory. I don't think we can always ignore that -- not on embedded systems, not even on desktops, which have more memory, but also may work on larger amounts of text.

...

...
So I think we should (must) leave `Char' as it is. Besides the usual suspects -- binary files and other protocols which depend on data type layout -- changing `Char' would mean breaking most programs that handle text(!) files with 7 bit (i.e., ASCII) and 8 bit charsets. We can't realistically do that.

Binary file interfaces can be simply handled with a subrange of integer, such as "TYPE byte = 0..255;". This means the system has to adjust storage usage to the cardinality of the subranges.

People would have to change all data file types, i.e. a massive breach of backward-compatibility. For binary files, this may be arguable, since we've had the same problem with other layout changes. For text files, I think it's just too much of a problem.

...

UTF8 need never be an internal format, as long as routines are provided to convert between UTF8 strings and internal Unicode strings.

See Waldek's comments. Using UTF8 internally can ideed be useful.

...

The amount of storage dedicated to a char need not affect the programs.

You mean the program semantics, which may be true (apart from binary files, and other layout-sensitive areas, which are always problematic, anyway, of course). But it certainly does affect the memory consumption, see above.

...

I agree that postponement is in order. Full range-checking should have priority. However things should be done with a view to future paths. This may well include two types of text files, say text and atext (and maybe utext). The narrower forms can be supplied by subrange definitions at the outer 0 scope level, making them easily customizable.

I don't think so. Even if you can define the subrange type, and thus a file of this subrange type on the user level, the `Text' file has special semantics (in contrast to `file of Char'), and it's not possible to define one's own file type with such semantics. Therefore all text file types required have to be built-in. And as discussed with Scott, we'll probably need at least text files with 8-bit coding, UTF-8-coding, Unicode (little and big endian). Whether these will be distinct types or the mode can be set (per file, perhaps at binding time), and whether/how to convert automatically if, e.g. reading a Unicode char from an 8-bit file etc., these are further questions which I'd also rather postpone now ...

Frank

Markus Gerwinski

11:26 p.m.

Hi Folks,

thank you all for your helpful answers! :-)

Waldek Hebisch wrote:

...

Markus Gerwinski wrote:

...
does GPC have a data type that corresponds to the wchar_t data type of gcc? If not, is there an easy way to port it and make sure it always has the same size as in gcc?

AFAICS we have no corresponding type. Adding such type to GPC is trivial, the hardest thing is choosing name -- 'WideChar' would be natural choice but in Delphi 'WideChar' is different (Windows insists on 16-bit chars). Hmm, `WcharInt' ?

I think in that case I'll rather work around the wchar, just treating any `WideString' as a pointer. Or by default translate it into unicode representation.

...

By the way, I belive that GNU autoconf have tests for size of wchar_t, so if you use autotools you should be able to define Pascal equivalent at configure time.

Wouldn't using this test mean passing its result to my units via a build argument? That's a thing I want to avoid. I want the units to be compilable with as little special arguments in the build command as possible.

Thanks again,

Markus

Frank Heckenbach

5 Oct 5 Oct

4:56 a.m.

Markus Gerwinski wrote:

...

Waldek Hebisch wrote:

...
Markus Gerwinski wrote:

...
does GPC have a data type that corresponds to the wchar_t data type of gcc? If not, is there an easy way to port it and make sure it always has the same size as in gcc?

AFAICS we have no corresponding type. Adding such type to GPC is trivial, the hardest thing is choosing name -- 'WideChar' would be natural choice but in Delphi 'WideChar' is different (Windows insists on 16-bit chars). Hmm, `WcharInt' ?

I think in that case I'll rather work around the wchar, just treating any `WideString' as a pointer.

That's certainly possible, if you can afford to treat it in an opaque way.

...

...
By the way, I belive that GNU autoconf have tests for size of wchar_t, so if you use autotools you should be able to define Pascal equivalent at configure time.

Wouldn't using this test mean passing its result to my units via a build argument? That's a thing I want to avoid. I want the units to be compilable with as little special arguments in the build command as possible.

FWIW, I agree, that's why I try to let the RTS do as much autoconf'ing (in other areas) as possible. And for `wchar_t', as I wrote before, we probably don't even need autoconf since the backend gives us the type already.

Frank

Markus Gerwinski

7:11 p.m.

Frank Heckenbach wrote:

...

Markus Gerwinski wrote:

...
I think in that case I'll rather work around the wchar, just treating any `WideString' as a pointer.

That's certainly possible, if you can afford to treat it in an opaque way.

One more possible solution: In my C layer, I'll transform any wchar_t into its unicode char representation before passing it to the Pascal interface. So there shouldn't be any data loss at all.

Yours,

Markus

Frank Heckenbach

9:09 p.m.

Markus Gerwinski wrote:

...

Frank Heckenbach wrote:

...
Markus Gerwinski wrote:

...
I think in that case I'll rather work around the wchar, just treating any `WideString' as a pointer.

That's certainly possible, if you can afford to treat it in an opaque way.

One more possible solution: In my C layer, I'll transform any wchar_t into its unicode char representation before passing it to the Pascal interface. So there shouldn't be any data loss at all.

What do you mean by "unicode char representation"? I thought wchar_t was just that. Or do you mean the 7/8 bit representation of chars that exist in that charset (ASCII/Latin1/...)? But what about other chars then?

Or do you just mean you transform them to an integer of known size (such as `int' = `CInteger')?

Frank

Markus Gerwinski

7 Oct 7 Oct

7:13 p.m.

Frank Heckenbach wrote:

...

Markus Gerwinski wrote:

...
One more possible solution: In my C layer, I'll transform any wchar_t into its unicode char representation before passing it to the Pascal interface. So there shouldn't be any data loss at all.

What do you mean by "unicode char representation"? I thought wchar_t was just that. Or do you mean the 7/8 bit representation of chars that exist in that charset (ASCII/Latin1/...)? But what about other chars then?

No, wchar_t and unicode are different things in this context. The wchar_t always has a fixed size of 2 resp. 4 bytes. In the unicode representation, every one of them is changed into a variable amount of "normal" characters. So e.g. the wchar_t 'A' is really transformed into a 1-byte-char 'A', whereas some special characters may be transformed into a 2-byte or 3-byte representation (AFAIK 3 bytes is the maximum here, but I'm not sure). See `man wcrtomb' for details on that.

Markus

Frank Heckenbach

8 Oct 8 Oct

12:06 a.m.

Markus Gerwinski wrote:

...

Frank Heckenbach wrote:

...
Markus Gerwinski wrote:

...
One more possible solution: In my C layer, I'll transform any wchar_t into its unicode char representation before passing it to the Pascal interface. So there shouldn't be any data loss at all.

What do you mean by "unicode char representation"? I thought wchar_t was just that. Or do you mean the 7/8 bit representation of chars that exist in that charset (ASCII/Latin1/...)? But what about other chars then?

No, wchar_t and unicode are different things in this context. The wchar_t always has a fixed size of 2 resp. 4 bytes. In the unicode representation, every one of them is changed into a variable amount of "normal" characters. So e.g. the wchar_t 'A' is really transformed into a 1-byte-char 'A', whereas some special characters may be transformed into a 2-byte or 3-byte representation (AFAIK 3 bytes is the maximum here, but I'm not sure). See `man wcrtomb' for details on that.

That's UTF-8 (Unicode Transport Format in 8 bit), actually.

OK, if you can afford to process data in UTF-8 on the Pascal side (as Waldek noted, different formats have different advantages), that should be alright.

Frank

Markus Gerwinski

10:27 p.m.

Frank Heckenbach wrote:

...

That's UTF-8 (Unicode Transport Format in 8 bit), actually.

Aah... thanks. Seems I mixed up two terms here.

...

OK, if you can afford to process data in UTF-8 on the Pascal side (as Waldek noted, different formats have different advantages), that should be alright.

The most important thing here is that wchar data come out of a database without information loss. And AFAIS that solution should do the trick, doesn't it?

Yours,

Markus

7585

Age (days ago)

7591

Last active (days ago)

gpc@gnu.de

21 comments

6 participants

tags (0)

participants (6)

CBFalconer
Frank Heckenbach
Markus Gerwinski
Prof. A Olowofoyeku (The African Chief)
Scott Moore
Waldek Hebisch