Hello,
there is some trouble when the display is set to UTF-8 and my program uses the CRT unit. Some characters are just shown as several spaces or even scramble the display.
For example I cannot use the german letter sharp s #$C3#$9F (U+00DF) or german quotation marks #$E2#$80#$9E ... #$E2#$80#$9C (U+201E ... U+201C).
It works fine without the CRT unit. Also when I use FreePascal it is all right.
I've tried to use the command "SetPCCharset(false);", but that didn't help. Is there anything else I could try?
Andreas K. Foerster wrote:
there is some trouble when the display is set to UTF-8 and my program uses the CRT unit. Some characters are just shown as several spaces or even scramble the display.
For example I cannot use the german letter sharp s #$C3#$9F (U+00DF) or german quotation marks #$E2#$80#$9E ... #$E2#$80#$9C (U+201E ... U+201C).
It works fine without the CRT unit. Also when I use FreePascal it is all right.
I've tried to use the command "SetPCCharset(false);", but that didn't help. Is there anything else I could try?
Sorry, GPC doesn't support larger-than-8-bit chars yet at all.
(Without CRT, it may seem to work, but actually UTF-8 doesn't fulfill the requirements of the "Char" type, so there are subtle differences. See previous discussions on the list.)
For CRT, it would probably require using ncursesw instead of ncurses.
Frank
Am Samstag, dem 17. Feb 2007 schrieb Frank Heckenbach:
Sorry, GPC doesn't support larger-than-8-bit chars yet at all.
(Without CRT, it may seem to work, but actually UTF-8 doesn't fulfill the requirements of the "Char" type, so there are subtle differences. See previous discussions on the list.)
I think, it shouldn't be complicated to keep at least some basic compatiblity to UTF-8. In contrast to UTF-16 there are some nice features in UTF-8.
First of all, 7-Bit ASCII chars are stored normally in one single byte. Since all Pascal commands and identifiers are in 7-bit ASCII, there is no problem for the compiler at all when you save your source code in UTF-8. The only multibyte characters then are in comments and in strings - where you can make use of them.
Another nice feature of UTF-8 is, that all bytes which make up a multibyte character are >=#128. So the only thing you have imho to do is to make sure, that all characters >=#128 are passed through unchanged...
Yes, I know, what I'm talking about. My project AKFQuiz can handle quizfiles encoded in UTF-8.
The graphical program uses a charset with more than 500 characters (see screenshot). The line-oriented variant linequiz can handle UTF-8 displays. Only the screen oriented program scrquiz has trouble when I compile it with GPC (while it works fine, when compiled with FPC).
For CRT, it would probably require using ncursesw instead of ncurses.
I don't think, that this is necessary.
Sorry, GPC doesn't support larger-than-8-bit chars yet at all.
Here is a small extract from my AKFQuiz project (qsys.pas). Maybe you want to use it for GPC. (GPLv2 or later)
-----------------------------------------------------------------------------
function EncodeUTF8(u: Unicode): string80; begin case u of $000000..$00007F : EncodeUTF8 := chr(u); $000080..$0007FF : EncodeUTF8 := chr($C0 or (u shr 6)) + chr($80 or (u and $3F)); $000800..$00FFFF : EncodeUTF8 := chr($E0 or (u shr (2*6))) + chr($80 or ((u shr 6) and $3F)) + chr($80 or (u and $3F)); $010000..$1FFFFF : EncodeUTF8 := chr($F0 or (u shr (3*6))) + chr($80 or ((u shr (2*6)) and $3F)) + chr($80 or ((u shr 6) and $3F)) + chr($80 or (u and $3F)); otherwise EncodeUTF8 := chr(unknownChar) end end;
{ gets the Unicode value of specified position in the UTF-8 string the position will be set to the next char RFC 3629, ISO 10646 } function getUTF8Char(const s: mystring; var p: integer): Unicode; var u : Unicode; begin getUTF8Char := unknownChar; u := unknownChar;
{ attention: do not use this decoder in security critical areas it also decodes invalid UTF-8 encodings, which can be exploited }
if (s='') or (p>length(s)) then exit;
{ skip followup-bytes } while ((ord(s[p]) and $C0)=$80) and (p<=length(s)) do inc(p); if p>length(s) then exit;
case ord(s[p]) of $00..$7F : begin { 1 byte encoding } u := ord(s[p]); inc(p, 1) end; $C2..$DF : begin { 2 byte encoding } u := (ord(s[p]) and $1F) shl 6; inc(p); u := u or (ord(s[p]) and $3F); inc(p) end; $E0..$EF : begin { 3 byte encoding } u := (ord(s[p]) and $0F) shl (2*6); inc(p); u := u or ((ord(s[p]) and $3F) shl 6); inc(p); u := u or (ord(s[p]) and $3F); inc(p); end; $F0..$F7 : begin { 4 byte encoding } u := (ord(s[p]) and $07) shl (3*6); inc(p); u := u or ((ord(s[p]) and $3F) shl (2*6)); inc(p); u := u or ((ord(s[p]) and $3F) shl 6); inc(p); u := u or (ord(s[p]) and $3F); inc(p) end; otherwise inc(p) { skip unknown char anyway } end; { case }
getUTF8Char := u end;
Andreas K. Foerster wrote:
For CRT, it would probably require using ncursesw instead of ncurses.
I don't think, that this is necessary.
Sorry, GPC doesn't support larger-than-8-bit chars yet at all.
Here is a small extract from my AKFQuiz project (qsys.pas). Maybe you want to use it for GPC. (GPLv2 or later)
I don't have a problem with conversion routines. The problem is that UTF-8 strings don't satisfy the requirements of Pascal "Char" and "String" types (e.g., that every sequence of valid "Char" values is a valid string value; as I said, see previous discussions.). Since GPC aims to support Pascal (plus extensions), not some language-more-or-less-similar-to-Pascal, I think we'll need a "Char" type larger than 8 bits for proper Unicode support. (And, BTW, processing on such a type is still easier that UTF-8. The main advantage of UTF-8 is reduces space requirement, so converting on I/O should also be the least work for the programmer. When implemented this way properly, most Pascal programs using strings should work without changes in Unicode, unless they specifically refer to ASCII, or ISO-8859-x, etc. properties.)
In the meantime, you can try to make things work with UTF-8. Of course, not everything will work, e.g. "Length" on a UTF-8 string will produce wrong results, and I suppose you know that and you already use workarounds in your code.
As for CRT, if you know what's necessary to work with such UTF-8 pseudo-strings, just make the required changes (it's free software ;-). I didn't think plain ncurses supports multibyte chars (think of counting chars for cursor position etc.), but if you know better, just do it (or describe what needs to be changed) ...
Frank
Am Samstag, dem 17. Feb 2007 schrieb Frank Heckenbach:
I don't have a problem with conversion routines. The problem is that UTF-8 strings don't satisfy the requirements of Pascal "Char" and "String" types (e.g., that every sequence of valid "Char" values is a valid string value; as I said, see previous discussions.). Since GPC aims to support Pascal (plus extensions), not some language-more-or-less-similar-to-Pascal,
Okay, you might say it is a misuse to use a "string" for UTF-8. But if the textfile is in UTF-8 and the display nterprets UTF-8, why should I not just make use of that? Why should I make things more complicated as they are? Just to obey a standard?
The GNU coding standards, section 4.1 states:
| The GNU Project regards standards published by other organizations as | suggestions, not orders. We consider those standards, but we do not obey | them. In developing a GNU program, you should implement an outside | standard's specifications when that makes the GNU system better overall | in an objective sense. When it doesn't, you shouldn't.
I think we'll need a "Char" type larger than 8 bits for proper Unicode support. (And, BTW, processing on such a type is still easier that UTF-8. The main advantage of UTF-8 is reduces space requirement, so converting on I/O should also be the least work for the programmer. When implemented this way properly, most Pascal programs using strings should work without changes in Unicode, unless they specifically refer to ASCII, or ISO-8859-x, etc. properties.)
Yes, I just showed you a part of my code. I also use a special type for unicode characters.
type Unicode = Cardinal; { Large integer value - at least 3 Bytes }
type UnicodeString = record Length: integer; content: array[1..MaxCharsPerLine] of Unicode end;
I fill this string with the UTF-8 decoder, I showed you in my last mail.
Because this UnicodeString takes a lot of space, I only have one line at once in the memory.
In the meantime, you can try to make things work with UTF-8. Of course, not everything will work, e.g. "Length" on a UTF-8 string will produce wrong results, and I suppose you know that and you already use workarounds in your code.
Yes.
function UTF8Length(const s: string): LongInt; var i, res: LongInt; begin res := 0; { count ASCII bytes and start bytes, ignore the rest } for i := 1 to length(s) do if (ord(s[i])<=127) or (ord(s[i])>=$C0) then inc(res); UTF8Length := res end;
This counts how many characters are in the string. Whether these characters actually take space or whether they are combining characters is a different question.
As for CRT, if you know what's necessary to work with such UTF-8 pseudo-strings, just make the required changes (it's free software ;-).
Okay, I must admit that I have trouble reading the code. It's a mix of two languages, and the code very sparsely commented... But I'm going to have a deeper look, when I have the time. (At the moment I'm working on a new project, which I haven't published yet)
I didn't think plain ncurses supports multibyte chars (think of counting chars for cursor position etc.), but if you know better, just do it (or describe what needs to be changed) ...
Well, I'm not sure. Is it actually counting chars to get the cursor position? If yes, then it's really a problem.
Andreas K. Foerster wrote:
Okay, you might say it is a misuse to use a "string" for UTF-8. But if the textfile is in UTF-8 and the display nterprets UTF-8, why should I not just make use of that? Why should I make things more complicated as they are?
It's not necessarily more complicated, if the compiler or runtime does the necessary conversions. In fact, it gets easier as soon as you use "Length", "Pos/SubStr", etc. as they just work as expected.
The GNU coding standards, section 4.1 states:
| The GNU Project regards standards published by other organizations as | suggestions, not orders. We consider those standards, but we do not obey | them. In developing a GNU program, you should implement an outside | standard's specifications when that makes the GNU system better overall | in an objective sense. When it doesn't, you shouldn't.
That applies to the GNU project in general, but one of GPC's stated goals (as stated in the first chapter of the manual) is to implement Pascal according to the standards. (But in this case, as I said, it should also make things easier for the programmer, once implemented properly.)
Yes, I just showed you a part of my code. I also use a special type for unicode characters.
type Unicode = Cardinal; { Large integer value - at least 3 Bytes }
It would probably be similar when built-in (in the future), except that we'd probably use a subrange (0 to $10ffff), so e.g. sets of such a type would be, though still large, at least realistically possible (even if we do not implement sparse sets, which would be another major project of doubtful merit). And, of course, it would then be a Char type, not an integer type.
type UnicodeString = record Length: integer; content: array[1..MaxCharsPerLine] of Unicode end;
I fill this string with the UTF-8 decoder, I showed you in my last mail.
Because this UnicodeString takes a lot of space, I only have one line at once in the memory.
Just make it a schema type with "MaxCharsPerLine" as the discriminant. This way you can allocate them as big or small as required. This type is then almost like the EP string type. (And if we build it in in the future, of course, EP strings will be like this automatically, and the string built-ins will just work with it.)
I didn't think plain ncurses supports multibyte chars (think of counting chars for cursor position etc.), but if you know better, just do it (or describe what needs to be changed) ...
Well, I'm not sure. Is it actually counting chars to get the cursor position? If yes, then it's really a problem.
Sure. I mean, if you output (via CRT or ncurses) a string, and then want to retrieve the cursor position, ncurses has to find out this position. OK, it could request it from the terminal (not all terminals support that, but perhaps all that support UTF-8 do). But then imagine you're doing this in a subwindow. ncurses could set the subwindow on the terminal side (if supported), but that would be inefficient if changed often (many escape sequences to send), and have other disadvantages (suppose, you abort or suspend such a program while in a subwindow, then the shell would be restricted to the subwindow as well, until you do "reset"). Or ncurses could try to retrieve the cursor position after each character output from the terminal to know when to line-wrap (again, many escape sequences to send and data to receive).
It gets worse. ncurses can redraw windows (particularly important for overlapping windows (panels)), so it needs to know which character is where. And there are problems if you output an incomplete UTF-8 sequence, it may leave the terminal in a confused state, or mess up a subsequent escape sequence ncurses needs to send, etc. That's why (AFAIK) ncursesw doesn't use UTF-8 in its interface but a "wide char" type, like your "Unicode" type above, so there can't be incomplete UTF-8 sequences exchanged between ncurses and the application.
Again, I really think such a "wide" type is easiest for processing (unless you really need to keep huge amounts of text in memory, so space becomes a concern), while UTF-8 is good for storage and I/O (network, terminal, ...). But converting between UTF-8 and "wide chars" is relatively easy, and when done at the right places, is all that's required, instead of specialized versions of all string processing routines. Treating UTF-8 like an 8 bit char type seems to work well in simpler cases, but gets increasingly difficult and error-prone as requirements grow. (With error-prone I mean that problems due to invalid/incomplete UTF-8 sequences can show up in many places in a program or lead to unexpected behaviour (if the program implicitly relies on their validity), while with a "wide" type, such errors are limited to the conversions, i.e. typically input (file, network, terminal, ...) operations.)
Frank