Am Samstag, dem 17. Feb 2007 schrieb Frank Heckenbach:
I don't have a problem with conversion routines. The problem is that UTF-8 strings don't satisfy the requirements of Pascal "Char" and "String" types (e.g., that every sequence of valid "Char" values is a valid string value; as I said, see previous discussions.). Since GPC aims to support Pascal (plus extensions), not some language-more-or-less-similar-to-Pascal,
Okay, you might say it is a misuse to use a "string" for UTF-8. But if the textfile is in UTF-8 and the display nterprets UTF-8, why should I not just make use of that? Why should I make things more complicated as they are? Just to obey a standard?
The GNU coding standards, section 4.1 states:
| The GNU Project regards standards published by other organizations as | suggestions, not orders. We consider those standards, but we do not obey | them. In developing a GNU program, you should implement an outside | standard's specifications when that makes the GNU system better overall | in an objective sense. When it doesn't, you shouldn't.
I think we'll need a "Char" type larger than 8 bits for proper Unicode support. (And, BTW, processing on such a type is still easier that UTF-8. The main advantage of UTF-8 is reduces space requirement, so converting on I/O should also be the least work for the programmer. When implemented this way properly, most Pascal programs using strings should work without changes in Unicode, unless they specifically refer to ASCII, or ISO-8859-x, etc. properties.)
Yes, I just showed you a part of my code. I also use a special type for unicode characters.
type Unicode = Cardinal; { Large integer value - at least 3 Bytes }
type UnicodeString = record Length: integer; content: array[1..MaxCharsPerLine] of Unicode end;
I fill this string with the UTF-8 decoder, I showed you in my last mail.
Because this UnicodeString takes a lot of space, I only have one line at once in the memory.
In the meantime, you can try to make things work with UTF-8. Of course, not everything will work, e.g. "Length" on a UTF-8 string will produce wrong results, and I suppose you know that and you already use workarounds in your code.
Yes.
function UTF8Length(const s: string): LongInt; var i, res: LongInt; begin res := 0; { count ASCII bytes and start bytes, ignore the rest } for i := 1 to length(s) do if (ord(s[i])<=127) or (ord(s[i])>=$C0) then inc(res); UTF8Length := res end;
This counts how many characters are in the string. Whether these characters actually take space or whether they are combining characters is a different question.
As for CRT, if you know what's necessary to work with such UTF-8 pseudo-strings, just make the required changes (it's free software ;-).
Okay, I must admit that I have trouble reading the code. It's a mix of two languages, and the code very sparsely commented... But I'm going to have a deeper look, when I have the time. (At the moment I'm working on a new project, which I haven't published yet)
I didn't think plain ncurses supports multibyte chars (think of counting chars for cursor position etc.), but if you know better, just do it (or describe what needs to be changed) ...
Well, I'm not sure. Is it actually counting chars to get the cursor position? If yes, then it's really a problem.