Peter N Lewis wrote:
At 13:16 +0100 13/3/06, Frank Heckenbach wrote:
Peter N Lewis wrote:
There may not be any point in supporting Unicode any further. From
I don't agree. It's not only Length (which is defined by the (mis-)using them as UTF-8 bytes, but then you're on your own if Length, SubStr/Copy, Index/Pos etc. behave strangely.)
Actually, with UTF-8, there is rarely any issues with Length, SubStr, Copy, Index, or Pos.
With UTF-8:
- Assuming valid UTF-8 strings, Pos will never mis-match.
- Length returns the "size" of the string. Given UTF-8, there must
be two different functions, one to return the size of the string in chars - which you call "Length" is personal preference.
No, it isn't. It's clearly defined in the standards and all dialacts I'm aware of.
- Searching for an ASCII character will always work as expected.
- SubStr/Copy require valid indexes and length, but the result will
be explicitly either correct, or an invalid UTF-8 string.
For some definition of valid, but that's not the definition used in the standards, or BP etc.
Your definition seems to be something like: Pos returns some value that, when passed as the 2nd argument to SubStr/Copy yields a substring starting with the string searched for in Pos.
But that's not the standard definition (more precisely, it's a weaker condition that follows from the standard definition, but not vice versa). The standard/BP definitions require that Index/Pos return a character index, and that SubStr/Copy accept all (in-range) character indexes and return valid strings.
Just one other example of the standard requirements:
6.1.9 CharacterÂstrings
characterÂstring = `'' { stringÂelement } `'' . stringÂelement = apostropheÂimage | stringÂcharacter . apostropheÂimage = `''' . stringÂcharacter = oneÂofÂaÂsetÂofÂimplementationÂdefinedÂcharacters .
In particular this means that the set of valid characters (and their interpretation) in implementation-defined, but given this set, any sequence is a valid string. This doesn't fit with UTF-8 (where not every byte sequence is valid, of course).
For example, if you have a search string, a replace string, and a source string, the exact same code using Pos and Copy will work for ASCII and for UTF-8, assuming all the strings are valid ASCII or valid UTF-8 respectively.
Well, that's one example where it works (i.e., only requires the weaker axioms). Other examples don't work. E.g., you might iterate over the characters of a string (for i := 1 to Length (s) do DoSomethingWith (s[i])), and depending on what DoSomethingWith does, this may or may not work. Or you could implement a scrolling routine that outputs the substring Copy (s, i, Width) where i in-/decreases by 1. With UTF-byte-strings, this would often produce invalid output. Or deleting a single character from a string (say, in an editor when the user presses delete or backspace) will generally leave an invalid UTF-8 string if done on bytes instead of characters.
As I said, you can, of course, (mis)use the 8-bit Char type as UTF-8 bytes, if you're always aware that you don't get full Pascal semantics (so you have to work around the differences, e.g., checking UTF-8 sequences before deleting a character or scrolling). Might be fine for you, not for other users (including myself, most of the time).
Of course, Pascal doesn't require Unicode (or in fact any character set of more than 22 characters ;-), but if we're talking about supporting standard (or BP etc. compatible) Pascal *and* Unicode, we have to recognize that we currently don't.
Handling case insensitively is more entertaining of course, but then it's already rarely handled well even with just ISO-8859-1.
Isn't it? Works fine for me (provided my locale is set up correctly).
Anyway, if someone things Unicode32 is worth implementing in the RTS, go for it, I'd just suggest that it's becoming less and less relevant.
I disagree. Even forgetting about Pascal semantics, I'd rather work with fixed-width representations internally, unless data size is a *real* concern, as the processing usually gets simpler.
Frank