Re: wchar

4 Oct 2004


      CBFalconer wrote:
...
Frank Heckenbach wrote:
...
Waldek Hebisch wrote:
... snip ...
...
...
I do not like the concept of "compilation modes". Unicode is not
merely a huge collections of code-points -- trurly Unicode aware
program is likely to use different algorithms and data structures.
Sure, many program will work without changes, but so using 8-bit
bytes and UTF-8.
I agree. All units/libraries would have to exist in both
compilation modes (and would need to be tested twice), etc. And, of
course, once other issues appear that "want" to have compilation
modes, we get to have 2^n copies of libraries which is even less
practical.
Not if the internal char type is 32 bit.
But I disagree already here. Even if we can ignore the (4 times)
bigger memory requirement, there is additional conversion to do, and
I don't think any plain-old text processing program should endure
that ...
...
Once that decision is
made the only problem is how to supply and describe the file
interfaces, and (of secondary importance) interconnection with C
coding.  I think we can already eliminate the need for compact char
internal representation for embedded systems, because the fixed
overhead is already monstrous.
Not necessarily. As you say, that overhead is fixed, while 32 bit
chars make a difference proportional to the size of data. I.e., you
can store only 1/4 of text data in a given memory. I don't think we
can always ignore that -- not on embedded systems, not even on
desktops, which have more memory, but also may work on larger
amounts of text.
...
...
So I think we should (must) leave `Char' as it is. Besides the
usual suspects -- binary files and other protocols which depend on
data type layout -- changing `Char' would mean breaking most
programs that handle text(!) files with 7 bit (i.e., ASCII) and 8
bit charsets. We can't realistically do that.
Binary file interfaces can be simply handled with a subrange of
integer, such as "TYPE byte = 0..255;". This means the system has
to adjust storage usage to the cardinality of the subranges.
People would have to change all data file types, i.e. a massive
breach of backward-compatibility. For binary files, this may be
arguable, since we've had the same problem with other layout
changes. For text files, I think it's just too much of a problem.
...
UTF8
need never be an internal format, as long as routines are provided
to convert between UTF8 strings and internal Unicode strings.
See Waldek's comments. Using UTF8 internally can ideed be useful.
...
The amount of storage dedicated to a char need not affect the
programs.
You mean the program semantics, which may be true (apart from binary
files, and other layout-sensitive areas, which are always
problematic, anyway, of course). But it certainly does affect the
memory consumption, see above.
...
I agree that postponement is in order. Full range-checking should
have priority.  However things should be done with a view to future
paths.  This may well include two types of text files, say text and
atext (and maybe utext).  The narrower forms can be supplied by
subrange definitions at the outer 0 scope level, making them easily
customizable.
I don't think so. Even if you can define the subrange type, and thus
a file of this subrange type on the user level, the `Text' file has
special semantics (in contrast to `file of Char'), and it's not
possible to define one's own file type with such semantics.
Therefore all text file types required have to be built-in. And as
discussed with Scott, we'll probably need at least text files with
8-bit coding, UTF-8-coding, Unicode (little and big endian). Whether
these will be distinct types or the mode can be set (per file,
perhaps at binding time), and whether/how to convert automatically
if, e.g. reading a Unicode char from an 8-bit file etc., these are
further questions which I'd also rather postpone now ...
Frank
-- 
Frank Heckenbach, frank@g-n-u.de, http://fjf.gnu.de/, 7977168E
GPC To-Do list, latest features, fixed bugs:
http://www.gnu-pascal.de/todo.html
NEW! GPC download signing key: ACB3 79B2 7EB2 B7A7 EFDE  D101 CD02 4C9D 0FE0 E5E8

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

Re: wchar