Re: CRT unit and UTF-8 displays

18 Feb 2007


      Andreas K. Foerster wrote:
...
Okay, you might say it is a misuse to use a "string" for UTF-8. But if
the textfile is in UTF-8 and the display nterprets UTF-8, why should I
not just make use of that? Why should I make things more complicated as
they are?
It's not necessarily more complicated, if the compiler or runtime
does the necessary conversions. In fact, it gets easier as soon as
you use "Length", "Pos/SubStr", etc. as they just work as expected.
...
The GNU coding standards, section 4.1 states:
| The GNU Project regards standards published by other organizations as
| suggestions, not orders. We consider those standards, but we do not obey
| them. In developing a GNU program, you should implement an outside
| standard's specifications when that makes the GNU system better overall
| in an objective sense. When it doesn't, you shouldn't.
That applies to the GNU project in general, but one of GPC's stated
goals (as stated in the first chapter of the manual) is to implement
Pascal according to the standards. (But in this case, as I said, it
should also make things easier for the programmer, once implemented
properly.)
...
Yes, I just showed you a part of my code. I also use a special type for
unicode characters.
type Unicode = Cardinal; { Large integer value - at least 3 Bytes }
It would probably be similar when built-in (in the future), except
that we'd probably use a subrange (0 to $10ffff), so e.g. sets of
such a type would be, though still large, at least realistically
possible (even if we do not implement sparse sets, which would be
another major project of doubtful merit). And, of course, it would
then be a Char type, not an integer type.
...
type UnicodeString = record
                       Length: integer;
   	       content: array[1..MaxCharsPerLine] of Unicode
   	     end;
I fill this string with the UTF-8 decoder, I showed you in my last mail.
Because this UnicodeString takes a lot of space, I only have one line at 
once in the memory.
Just make it a schema type with "MaxCharsPerLine" as the
discriminant. This way you can allocate them as big or small as
required. This type is then almost like the EP string type. (And if
we build it in in the future, of course, EP strings will be like
this automatically, and the string built-ins will just work with
it.)
...
...
I didn't think plain ncurses supports multibyte chars
(think of counting chars for cursor position etc.), but if you know
better, just do it (or describe what needs to be changed) ...
Well, I'm not sure. Is it actually counting chars to get the cursor
position? If yes, then it's really a problem.
Sure. I mean, if you output (via CRT or ncurses) a string, and then
want to retrieve the cursor position, ncurses has to find out this
position. OK, it could request it from the terminal (not all
terminals support that, but perhaps all that support UTF-8 do). But
then imagine you're doing this in a subwindow. ncurses could set the
subwindow on the terminal side (if supported), but that would be
inefficient if changed often (many escape sequences to send), and
have other disadvantages (suppose, you abort or suspend such a
program while in a subwindow, then the shell would be restricted to
the subwindow as well, until you do "reset"). Or ncurses could try
to retrieve the cursor position after each character output from the
terminal to know when to line-wrap (again, many escape sequences to
send and data to receive).
It gets worse. ncurses can redraw windows (particularly important
for overlapping windows (panels)), so it needs to know which
character is where. And there are problems if you output an
incomplete UTF-8 sequence, it may leave the terminal in a confused
state, or mess up a subsequent escape sequence ncurses needs to
send, etc. That's why (AFAIK) ncursesw doesn't use UTF-8 in its
interface but a "wide char" type, like your "Unicode" type above, so
there can't be incomplete UTF-8 sequences exchanged between ncurses
and the application.
Again, I really think such a "wide" type is easiest for processing
(unless you really need to keep huge amounts of text in memory, so
space becomes a concern), while UTF-8 is good for storage and I/O
(network, terminal, ...). But converting between UTF-8 and "wide
chars" is relatively easy, and when done at the right places, is all
that's required, instead of specialized versions of all string
processing routines. Treating UTF-8 like an 8 bit char type seems to
work well in simpler cases, but gets increasingly difficult and
error-prone as requirements grow. (With error-prone I mean that
problems due to invalid/incomplete UTF-8 sequences can show up in
many places in a program or lead to unexpected behaviour (if the
program implicitly relies on their validity), while with a "wide"
type, such errors are limited to the conversions, i.e. typically
input (file, network, terminal, ...) operations.)
Frank
-- 
Frank Heckenbach, f.heckenbach@fh-soft.de, http://fjf.gnu.de/, 7977168E
GPC To-Do list, latest features, fixed bugs:
http://www.gnu-pascal.de/todo.html
GPC download signing key: ACB3 79B2 7EB2 B7A7 EFDE  D101 CD02 4C9D 0FE0 E5E8

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

Re: CRT unit and UTF-8 displays