Andreas K. Foerster wrote:
Okay, you might say it is a misuse to use a "string" for UTF-8. But if the textfile is in UTF-8 and the display nterprets UTF-8, why should I not just make use of that? Why should I make things more complicated as they are?
It's not necessarily more complicated, if the compiler or runtime does the necessary conversions. In fact, it gets easier as soon as you use "Length", "Pos/SubStr", etc. as they just work as expected.
The GNU coding standards, section 4.1 states:
| The GNU Project regards standards published by other organizations as | suggestions, not orders. We consider those standards, but we do not obey | them. In developing a GNU program, you should implement an outside | standard's specifications when that makes the GNU system better overall | in an objective sense. When it doesn't, you shouldn't.
That applies to the GNU project in general, but one of GPC's stated goals (as stated in the first chapter of the manual) is to implement Pascal according to the standards. (But in this case, as I said, it should also make things easier for the programmer, once implemented properly.)
Yes, I just showed you a part of my code. I also use a special type for unicode characters.
type Unicode = Cardinal; { Large integer value - at least 3 Bytes }
It would probably be similar when built-in (in the future), except that we'd probably use a subrange (0 to $10ffff), so e.g. sets of such a type would be, though still large, at least realistically possible (even if we do not implement sparse sets, which would be another major project of doubtful merit). And, of course, it would then be a Char type, not an integer type.
type UnicodeString = record Length: integer; content: array[1..MaxCharsPerLine] of Unicode end;
I fill this string with the UTF-8 decoder, I showed you in my last mail.
Because this UnicodeString takes a lot of space, I only have one line at once in the memory.
Just make it a schema type with "MaxCharsPerLine" as the discriminant. This way you can allocate them as big or small as required. This type is then almost like the EP string type. (And if we build it in in the future, of course, EP strings will be like this automatically, and the string built-ins will just work with it.)
I didn't think plain ncurses supports multibyte chars (think of counting chars for cursor position etc.), but if you know better, just do it (or describe what needs to be changed) ...
Well, I'm not sure. Is it actually counting chars to get the cursor position? If yes, then it's really a problem.
Sure. I mean, if you output (via CRT or ncurses) a string, and then want to retrieve the cursor position, ncurses has to find out this position. OK, it could request it from the terminal (not all terminals support that, but perhaps all that support UTF-8 do). But then imagine you're doing this in a subwindow. ncurses could set the subwindow on the terminal side (if supported), but that would be inefficient if changed often (many escape sequences to send), and have other disadvantages (suppose, you abort or suspend such a program while in a subwindow, then the shell would be restricted to the subwindow as well, until you do "reset"). Or ncurses could try to retrieve the cursor position after each character output from the terminal to know when to line-wrap (again, many escape sequences to send and data to receive).
It gets worse. ncurses can redraw windows (particularly important for overlapping windows (panels)), so it needs to know which character is where. And there are problems if you output an incomplete UTF-8 sequence, it may leave the terminal in a confused state, or mess up a subsequent escape sequence ncurses needs to send, etc. That's why (AFAIK) ncursesw doesn't use UTF-8 in its interface but a "wide char" type, like your "Unicode" type above, so there can't be incomplete UTF-8 sequences exchanged between ncurses and the application.
Again, I really think such a "wide" type is easiest for processing (unless you really need to keep huge amounts of text in memory, so space becomes a concern), while UTF-8 is good for storage and I/O (network, terminal, ...). But converting between UTF-8 and "wide chars" is relatively easy, and when done at the right places, is all that's required, instead of specialized versions of all string processing routines. Treating UTF-8 like an 8 bit char type seems to work well in simpler cases, but gets increasingly difficult and error-prone as requirements grow. (With error-prone I mean that problems due to invalid/incomplete UTF-8 sequences can show up in many places in a program or lead to unexpected behaviour (if the program implicitly relies on their validity), while with a "wide" type, such errors are limited to the conversions, i.e. typically input (file, network, terminal, ...) operations.)
Frank