Re: wchar

4 Oct 2004


      Waldek Hebisch wrote:
...
Scott Moore wrote:
...
Waldek Hebisch wrote:
...
Scott Moore wrote:
...
Wide strings, wide sets, ability to read and write wide characters and
probably also utf-8 to and from files.
To me it well illustrates the fallacy of using C as an example for new
extensions. Pascal is not C, and has fundamentally different aims.
Could you be more explicit here? Do you think that we should have
a single char type? If yes, then would you limit that type to 8 bits
(or maybe 16)?
There are going to be all kinds of opinions here. I believe Unicode support
makes the most sense for Pascal by embracing it. In Pascal "char" is a
distinct type. No assumptions are made about its size or characteristics
other than ordering. Char is not used to mean "byte like entity" as in C.
So in a "wide mode program", char is 16 bits, set of char is still possible,
meaning a minimum set is 65535 elements (and implementing sparse sets 
becomes
more important), and string constants and data are 16 bit. This makes sense
because Unicode is a superset of ASCII, i.e., a program that follows the
rules of ISO 7185 Pascal should be recompilable for Unicode mode without
any changes whatever.
I do not like the concept of "compilation modes". Unicode is not merely
a huge collections of code-points -- trurly Unicode aware program is 
likely to use different algorithms and data structures. Sure, many 
program will work without changes, but so using 8-bit bytes and
UTF-8.
I agree. All units/libraries would have to exist in both compilation
modes (and would need to be tested twice), etc. And, of course, once
other issues appear that "want" to have compilation modes, we get to
have 2^n copies of libraries which is even less practical.
...
Note also that full Unicode is 20 bits (and because of combining chars
still you cannot indentify characters with Unicode code-points). Glibc
normally uses 32-bit 'wchar_t', which in another argument for 32-bit
chars.
I also agree. If anything then either 32 bits immediately (or just
match C's wchar_t, which the backend seems to provide us -- I still
have to check in detail).
...
It is "good Wirth/Pascal" tradition to offer only one size. But IHMO
it is _not_ GNU Pascal tradition.
Of course, even Wirth's Pascal has integer subranges, so if we could
define ASCII char as a subrange of Unicode char, we might be fine.
But this would mean (a) `Char' would be Unicode (the standard type
must be the biggest if we want to follow the standard's "spirit"
here) which is too big a change I fear, (b) it would not cater for a
UTF8 type (in particular such a string type) which is neither an
array of ASCII chars or of Unicode chars.
So I think we should (must) leave `Char' as it is. Besides the usual
suspects -- binary files and other protocols which depend on data
type layout -- changing `Char' would mean breaking most programs
that handle text(!) files with 7 bit (i.e., ASCII) and 8 bit
charsets. We can't realistically do that.
BTW, Scott, you argued partly as if Pascal was based on ASCII. As
you certainly know, this is not required, and in fact, many
non-English speaking countries use an extended charset (e.g.
ISO-8859-n). I do this myself with GPC today (of course, not for
program identifiers, but for `Char' data, just to avoid confusion).
While Latin1 (ISO-8859-1), and only this one, is upward compatible
to Unicode, none of them are compatible to UTF-8 (except for the
ASCII subset), so your compatibility arguments already fall down
here. And in your I/O list, you'd definitely have to add I/O in an 8
bit charset -- where how to select the charset to convert Unicode to
and from is another question, but there must be an (easy) way,
because that's what a large part of the world uses today.
...
One of GPC selling points is interfacing
to GNU C, so we should provide a type matching with 'wchar_t'.
Agreed.
...
We try
to be compatible with all significant Pascal dialects (including Delphi),
and Delphi for long time had two (maybe more) charactes types.
Of course, as often, badly named -- if you mean `AnsiChar'. Which
ANSI standard does it refer to? (I actually don't know American
standards too well. I suppose Unicode has an ANSI number, but so
does ASCII, I suppose, and perhaps Latin1 etc.) If read in a Pascal
context, it would sound like the `Char' type of ANSI Pascal (which
is equivalent to ISO Pascal). To old-time Unix users, "ansi" is
better known as a terminal type, and to old-time Dos users it also
refers to "ansi"-like terminal sequences, sometimes, for some
contorted reasons, plus IBM PC/MS DOS specific 8 bit character sets
(which are quite unlike ISO-8859-n or Unicode).
So at best, the name seems ambiguous and confusion, and as often,
I'd rather build in another type name, and make (leave) `AnsiChar'
an "only Borland compatibility" thing.
...
Also, we
have a bunch of integer types. So I see no reason to insist on single 
char type. There is a reason to insist on single type for string literls.
However if normal string can represent enough codes and if we provide
builtin conversion function (builtin, because we want to constant-fold
them), than can initialse other types via explicit conversion.
In principle I agree, though I shudder at the tought of implementing
the string types this will need and their conversions. But it might
be inevitable in the long run ...
...
Anyway, regarless of what solution we choose to support Unicode, we
need a type for interfacing to C. If we think that the type is only
for interfacing, then name like 'WcharInt' (and making it an integer
type) make sense.
If it is *only* for interfacing, I'd even suggest something like
`CWCharT' or `CWideChar', to make that clear, just like we have
`CString', `CInteger' (in the next release) etc.
But if we foresee to use it in Pascal later, I'd rather use a more
readable name instead. Perhaps just `WideChar' (but what does "wide"
mean, isn't an "m" wider than an "i"? ;-).
And perhaps we could even make it an "abstract ordinal type" for
now, i.e., like an enum type (with a rather large range though)
whose identifiers were forgotten (as in, not exported from a
module). So one could use ordinal functions like `Ord', `Succ' etc.,
but not much more. This should avoid breaking too many things if
we'll actually turn it into a char type sometime.
...
I personally think that we may use such type as Pascal wide character
type, but if that is too controversial, then I propose to just add
an interfacing type now and postpone the question of proper Unicode
support.
I'd also postpone it (since we've both probably have enough other
things to do first), but implementing such a type (such as, layout
compatible to C's `wchar_t', abstract ordinal type, meaningful name)
might be reasonable now ...
Frank
-- 
Frank Heckenbach, frank@g-n-u.de, http://fjf.gnu.de/, 7977168E
GPC To-Do list, latest features, fixed bugs:
http://www.gnu-pascal.de/todo.html
NEW! GPC download signing key: ACB3 79B2 7EB2 B7A7 EFDE  D101 CD02 4C9D 0FE0 E5E8

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

Re: wchar