Am Samstag, dem 17. Feb 2007 schrieb Frank Heckenbach:
Sorry, GPC doesn't support larger-than-8-bit chars yet at all.
(Without CRT, it may seem to work, but actually UTF-8 doesn't fulfill the requirements of the "Char" type, so there are subtle differences. See previous discussions on the list.)
I think, it shouldn't be complicated to keep at least some basic compatiblity to UTF-8. In contrast to UTF-16 there are some nice features in UTF-8.
First of all, 7-Bit ASCII chars are stored normally in one single byte. Since all Pascal commands and identifiers are in 7-bit ASCII, there is no problem for the compiler at all when you save your source code in UTF-8. The only multibyte characters then are in comments and in strings - where you can make use of them.
Another nice feature of UTF-8 is, that all bytes which make up a multibyte character are >=#128. So the only thing you have imho to do is to make sure, that all characters >=#128 are passed through unchanged...
Yes, I know, what I'm talking about. My project AKFQuiz can handle quizfiles encoded in UTF-8.
The graphical program uses a charset with more than 500 characters (see screenshot). The line-oriented variant linequiz can handle UTF-8 displays. Only the screen oriented program scrquiz has trouble when I compile it with GPC (while it works fine, when compiled with FPC).
For CRT, it would probably require using ncursesw instead of ncurses.
I don't think, that this is necessary.
Sorry, GPC doesn't support larger-than-8-bit chars yet at all.
Here is a small extract from my AKFQuiz project (qsys.pas). Maybe you want to use it for GPC. (GPLv2 or later)
-----------------------------------------------------------------------------
function EncodeUTF8(u: Unicode): string80; begin case u of $000000..$00007F : EncodeUTF8 := chr(u); $000080..$0007FF : EncodeUTF8 := chr($C0 or (u shr 6)) + chr($80 or (u and $3F)); $000800..$00FFFF : EncodeUTF8 := chr($E0 or (u shr (2*6))) + chr($80 or ((u shr 6) and $3F)) + chr($80 or (u and $3F)); $010000..$1FFFFF : EncodeUTF8 := chr($F0 or (u shr (3*6))) + chr($80 or ((u shr (2*6)) and $3F)) + chr($80 or ((u shr 6) and $3F)) + chr($80 or (u and $3F)); otherwise EncodeUTF8 := chr(unknownChar) end end;
{ gets the Unicode value of specified position in the UTF-8 string the position will be set to the next char RFC 3629, ISO 10646 } function getUTF8Char(const s: mystring; var p: integer): Unicode; var u : Unicode; begin getUTF8Char := unknownChar; u := unknownChar;
{ attention: do not use this decoder in security critical areas it also decodes invalid UTF-8 encodings, which can be exploited }
if (s='') or (p>length(s)) then exit;
{ skip followup-bytes } while ((ord(s[p]) and $C0)=$80) and (p<=length(s)) do inc(p); if p>length(s) then exit;
case ord(s[p]) of $00..$7F : begin { 1 byte encoding } u := ord(s[p]); inc(p, 1) end; $C2..$DF : begin { 2 byte encoding } u := (ord(s[p]) and $1F) shl 6; inc(p); u := u or (ord(s[p]) and $3F); inc(p) end; $E0..$EF : begin { 3 byte encoding } u := (ord(s[p]) and $0F) shl (2*6); inc(p); u := u or ((ord(s[p]) and $3F) shl 6); inc(p); u := u or (ord(s[p]) and $3F); inc(p); end; $F0..$F7 : begin { 4 byte encoding } u := (ord(s[p]) and $07) shl (3*6); inc(p); u := u or ((ord(s[p]) and $3F) shl (2*6)); inc(p); u := u or ((ord(s[p]) and $3F) shl 6); inc(p); u := u or (ord(s[p]) and $3F); inc(p) end; otherwise inc(p) { skip unknown char anyway } end; { case }
getUTF8Char := u end;