Re: Strings and standard Pascal

14 Mar 2006

      Peter N Lewis wrote:
...
At 13:16 +0100 13/3/06, Frank Heckenbach wrote:
...
Peter N Lewis wrote:
...
There may not be any point in supporting Unicode any further.  From
I don't agree. It's not only Length (which is defined by the
(mis-)using them as UTF-8 bytes, but then you're on your own if
Length, SubStr/Copy, Index/Pos etc. behave strangely.)
Actually, with UTF-8, there is rarely any issues with Length, SubStr, 
Copy, Index, or Pos.
With UTF-8:

Assuming valid UTF-8 strings, Pos will never mis-match.
Length returns the "size" of the string.  Given UTF-8, there must

be two different functions, one to return the size of the string in 
chars - which you call "Length" is personal preference.
No, it isn't. It's clearly defined in the standards and all dialacts
I'm aware of.
...

Searching for an ASCII character will always work as expected.
SubStr/Copy require valid indexes and length, but the result will

be explicitly either correct, or an invalid UTF-8 string.
For some definition of valid, but that's not the definition used in
the standards, or BP etc.
Your definition seems to be something like: Pos returns some value
that, when passed as the 2nd argument to SubStr/Copy yields a
substring starting with the string searched for in Pos.
But that's not the standard definition (more precisely, it's a
weaker condition that follows from the standard definition, but not
vice versa). The standard/BP definitions require that Index/Pos
return a character index, and that SubStr/Copy accept all (in-range)
character indexes and return valid strings.
Just one other example of the standard requirements:
6.1.9 CharacterÂstrings
characterÂstring = `'' { stringÂelement } `'' . 
     stringÂelement   = apostropheÂimage | stringÂcharacter . 
     apostropheÂimage = `''' . 
     stringÂcharacter = oneÂofÂaÂsetÂofÂimplementationÂdefinedÂcharacters .
In particular this means that the set of valid characters (and their
interpretation) in implementation-defined, but given this set, any
sequence is a valid string. This doesn't fit with UTF-8 (where not
every byte sequence is valid, of course).
...
For example, if you have a search string, a replace string, and a 
source string, the exact same code using Pos and Copy will work for 
ASCII and for UTF-8, assuming all the strings are valid ASCII or 
valid UTF-8 respectively.
Well, that's one example where it works (i.e., only requires the
weaker axioms). Other examples don't work. E.g., you might iterate
over the characters of a string (for i := 1 to Length (s) do
DoSomethingWith (s[i])), and depending on what DoSomethingWith does,
this may or may not work. Or you could implement a scrolling routine
that outputs the substring Copy (s, i, Width) where i in-/decreases
by 1. With UTF-byte-strings, this would often produce invalid
output. Or deleting a single character from a string (say, in an
editor when the user presses delete or backspace) will generally
leave an invalid UTF-8 string if done on bytes instead of
characters.
As I said, you can, of course, (mis)use the 8-bit Char type as UTF-8
bytes, if you're always aware that you don't get full Pascal
semantics (so you have to work around the differences, e.g.,
checking UTF-8 sequences before deleting a character or scrolling).
Might be fine for you, not for other users (including myself, most
of the time).
Of course, Pascal doesn't require Unicode (or in fact any character
set of more than 22 characters ;-), but if we're talking about
supporting standard (or BP etc. compatible) Pascal *and* Unicode, we
have to recognize that we currently don't.
...
Handling case insensitively is more entertaining of course, but then 
it's already rarely handled well even with just ISO-8859-1.
Isn't it? Works fine for me (provided my locale is set up
correctly).
...
Anyway, if someone things Unicode32 is worth implementing in the RTS, 
go for it, I'd just suggest that it's becoming less and less relevant.
I disagree. Even forgetting about Pascal semantics, I'd rather work
with fixed-width representations internally, unless data size is a
*real* concern, as the processing usually gets simpler.
Frank
-- 
Frank Heckenbach, frank@g-n-u.de, http://fjf.gnu.de/, 7977168E
GPC To-Do list, latest features, fixed bugs:
http://www.gnu-pascal.de/todo.html
GPC download signing key: ACB3 79B2 7EB2 B7A7 EFDE  D101 CD02 4C9D 0FE0 E5E8

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

Re: Strings and standard Pascal