Re: CRT unit and UTF-8 displays

17 Feb 2007


      Am Samstag, dem 17. Feb 2007 schrieb Frank Heckenbach:
...
Sorry, GPC doesn't support larger-than-8-bit chars yet at all.
(Without CRT, it may seem to work, but actually UTF-8 doesn't
fulfill the requirements of the "Char" type, so there are subtle
differences. See previous discussions on the list.)
I think, it shouldn't be complicated to keep at least some basic
compatiblity to UTF-8. In contrast to UTF-16 there are some nice
features in UTF-8.
First of all, 7-Bit ASCII chars are stored normally in one single byte.
Since all Pascal commands and identifiers are in 7-bit ASCII, there is
no problem for the compiler at all when you save your source code in
UTF-8. The only multibyte characters then are in comments and in strings
- where you can make use of them.
Another nice feature of UTF-8 is, that all bytes which make up a
multibyte character are >=#128. So the only thing you have imho to do is
to make sure, that all characters >=#128 are passed through unchanged...
Yes, I know, what I'm talking about. My project AKFQuiz can handle
quizfiles encoded in UTF-8.
http://akfquiz.nongnu.org/
The graphical program uses a charset with more than 500 characters 
(see screenshot). The line-oriented variant linequiz can handle UTF-8
displays. Only the screen oriented program scrquiz has trouble when I
compile it with GPC (while it works fine, when compiled with FPC).
...
For CRT, it would probably require using ncursesw instead of
ncurses.
I don't think, that this is necessary.
...
Sorry, GPC doesn't support larger-than-8-bit chars yet at all.
Here is a small extract from my AKFQuiz project (qsys.pas).
Maybe you want to use it for GPC. (GPLv2 or later)
-----------------------------------------------------------------------------
function EncodeUTF8(u: Unicode): string80;
begin
case u of
  $000000..$00007F : EncodeUTF8 := chr(u);
  $000080..$0007FF : EncodeUTF8 := chr($C0 or (u shr 6)) +
                                   chr($80 or (u and $3F));
  $000800..$00FFFF : EncodeUTF8 := chr($E0 or (u shr (2*6))) +
                                   chr($80 or ((u shr 6) and $3F)) +
                                   chr($80 or (u and $3F));
  $010000..$1FFFFF : EncodeUTF8 := chr($F0 or (u shr (3*6))) +
                                   chr($80 or ((u shr (2*6)) and $3F)) +
                                   chr($80 or ((u shr 6) and $3F)) +
                                   chr($80 or (u and $3F));
  otherwise EncodeUTF8 := chr(unknownChar)
  end
end;
{ gets the Unicode value of specified position in the UTF-8 string
  the position will be set to the next char
  RFC 3629, ISO 10646 }
function getUTF8Char(const s: mystring; var p: integer): Unicode;
var u : Unicode;
begin
getUTF8Char := unknownChar;
u := unknownChar;
{ attention: do not use this decoder in security critical areas
  it also decodes invalid UTF-8 encodings, which can be exploited }
if (s='') or (p>length(s)) then exit;
{ skip followup-bytes }
while ((ord(s[p]) and $C0)=$80) and (p<=length(s)) do inc(p);
if p>length(s) then exit;
case ord(s[p]) of
     $00..$7F : begin { 1 byte encoding }
                u := ord(s[p]);
                inc(p, 1)
                end;
     $C2..$DF : begin { 2 byte encoding }
                u := (ord(s[p]) and $1F) shl 6;
                inc(p);
                u := u or (ord(s[p]) and $3F);
                inc(p)
                end;
     $E0..$EF : begin { 3 byte encoding }
                u := (ord(s[p]) and $0F) shl (2*6);
                inc(p);
                u := u or ((ord(s[p]) and $3F) shl 6);
                inc(p);
                u := u or (ord(s[p]) and $3F);
                inc(p);
                end;
     $F0..$F7 : begin { 4 byte encoding }
                u := (ord(s[p]) and $07) shl (3*6);
                inc(p);
                u := u or ((ord(s[p]) and $3F) shl (2*6));
                inc(p);
                u := u or ((ord(s[p]) and $3F) shl 6);
                inc(p);
                u := u or (ord(s[p]) and $3F);
                inc(p)
                end;
     otherwise inc(p) { skip unknown char anyway }
     end; { case }
getUTF8Char := u
end;
-- 
AKFoerster

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

Re: CRT unit and UTF-8 displays