Re: CRT unit and UTF-8 displays

18 Feb 2007


      Am Samstag, dem 17. Feb 2007 schrieb Frank Heckenbach:
...
I don't have a problem with conversion routines. The problem is that
UTF-8 strings don't satisfy the requirements of Pascal "Char" and
"String" types (e.g., that every sequence of valid "Char" values is
a valid string value; as I said, see previous discussions.). Since
GPC aims to support Pascal (plus extensions), not some
language-more-or-less-similar-to-Pascal,
Okay, you might say it is a misuse to use a "string" for UTF-8. But if
the textfile is in UTF-8 and the display nterprets UTF-8, why should I
not just make use of that? Why should I make things more complicated as
they are? Just to obey a standard?
The GNU coding standards, section 4.1 states:
| The GNU Project regards standards published by other organizations as
| suggestions, not orders. We consider those standards, but we do not obey
| them. In developing a GNU program, you should implement an outside
| standard's specifications when that makes the GNU system better overall
| in an objective sense. When it doesn't, you shouldn't.
...
I think we'll need a "Char"
type larger than 8 bits for proper Unicode support. (And, BTW,
processing on such a type is still easier that UTF-8. The main
advantage of UTF-8 is reduces space requirement, so converting on
I/O should also be the least work for the programmer. When
implemented this way properly, most Pascal programs using strings
should work without changes in Unicode, unless they specifically
refer to ASCII, or ISO-8859-x, etc. properties.)
Yes, I just showed you a part of my code. I also use a special type for
unicode characters.
type Unicode = Cardinal; { Large integer value - at least 3 Bytes }
type UnicodeString = record
                       Length: integer;
    	       content: array[1..MaxCharsPerLine] of Unicode
    	     end;
I fill this string with the UTF-8 decoder, I showed you in my last mail.
Because this UnicodeString takes a lot of space, I only have one line at 
once in the memory.
...
In the meantime, you can try to make things work with UTF-8. Of
course, not everything will work, e.g. "Length" on a UTF-8 string
will produce wrong results, and I suppose you know that and you
already use workarounds in your code.
Yes.
function UTF8Length(const s: string): LongInt;
var i, res: LongInt;
begin
res := 0;
{ count ASCII bytes and start bytes, ignore the rest }
for i := 1 to length(s) do
   if (ord(s[i])<=127) or (ord(s[i])>=$C0) then inc(res);
UTF8Length := res
end;
This counts how many characters are in the string.
Whether these characters actually take space or whether they 
are combining characters is a different question.
...
As for CRT, if you know what's necessary to work with such UTF-8
pseudo-strings, just make the required changes (it's free
software ;-).
Okay, I must admit that I have trouble reading the code. It's a mix of
two languages, and the code very sparsely commented...
But I'm going to have a deeper look, when I have the time.
(At the moment I'm working on a new project, which I haven't published
yet)
...
I didn't think plain ncurses supports multibyte chars
(think of counting chars for cursor position etc.), but if you know
better, just do it (or describe what needs to be changed) ...
Well, I'm not sure. Is it actually counting chars to get the cursor
position? If yes, then it's really a problem.
-- 
AKFoerster

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

Re: CRT unit and UTF-8 displays