Fwd: Short Strings - Gpc

6 Jul 2006


      Forwarded from: marcov@stack.nl (Marco van de Voort)
In gmane.comp.compilers.gpc, you wrote:
...
...
FPC is a like Delphi, but since it supports the TP libs (Delphi doesn't),
there is some use, also in the textmode IDE. Also the compiler and utils
itself originally used shortstrings. However we are moving away from that
for some time now (already since 2000).
Well, I used BP myself long time ago, .........
was nothing much to be changed (except for sometimes declaring a
bigger maximum size :-).
I had similar experiences with my own codebases, because they were already
abstracted for string usage since they were ported back from Modula2 before.
However in my experience, there only is a small fraction of existing
codebases that is that clean. See e.g. the average level of filth in SWAG.
...
...
In time, only the core compiler
itself (symbol tables etc) will be the only non legacy code using
shortstrings. This because the size limit is no problem for
identifiers/tokens etc, and it is quite a bit faster.
This could be because the static nature allocates them together with the
objects, while dynamic strings would be an allocation extra,
I suppose you're talking about Delphi's string variant which is only
dynamic AFAIR.
Correct, and specifically inside the FPC compiler core. The FPC compiler is
speedwise mostly dependant on the memory manager and lexing.
In plain Delphi (and FPC in de delphi mode), "String" means ansistring. While in
TP and (FPC in TP modes) the identifier "String" means shortstring.
This makes switching easy, provided the shortstring code is cleaned up (no
write beyond the range 1..length(s) and no s[0] use).
...
However, Extended Pascal's strings can be static or (explicitly) dynamic,
just like other types, so you can use them mostly the same way as short
strings, and there is no extra overhead, except for a few bytes more to
store the bigger length field and the capacity.
If the statical variant is instantiated with the class/object, yes then it
would.
But as said, there is no real reason to clean shortstring out of the
compiler core. Maybe except if we switch to unicode (or UTF8) mangled names
and identifiers, that will be a major rewrite anyway.
...
I think for some cases, limited strings are still useful, for other
cases, unlimited strings are preferable.
Correct. However EP now has this static type. And FPC also has one,
reasonably transparant. The point is more that it is not worth to invest in
a second one unless there are very good reasons (and the current mac pascal
discussion is one for GPC IMHO)
...
...
Correct. And it is not just the interfaces, but also demoes, documentation
and a nearly all Pascal code available on the Mac.
As I said WRT BP above, I think such code, as long as it doesn't use
external short-string interfaces, can probably convert to (EP) long
strings with minimal effort.
It is a barrier. And to a programmer that comes from a all-hands-held commercial
compiler, it is yet another one over the already existing ones (the change of IDE,
project system, misc units and language changes). That's what Peter tries
to point out.
For a new developer everything is new, and _every_ change that forces him to
inspect every line in the whole source is a major pain and a risk. Even if
every change for itself is negiable.
I learned this the hard way during FPC Delphi compability convergence. You
quickly port and fix heaps of all those little issues, and then when it
finalyl compiles it turns out the program doesn't work anymore. The littler
the changes the less chance on such scenario.
However you and me and probably the rest on this list can handle decent
debuggers and if necessary probably even GDB down to assembler level if we
have to. Most of the Apple newbies can't though.
...
...
In practice this means auto conversion by assignment and passing to value
parameters. FPC/Delphi do this already.
We do that for CStrings, so most of the time we can avoid using
CStrings in Pascal code this way. An automatic back-conversion
(e.g., for function results) also seems possible, though we don't do
that yet. Real reference parameters would be a problem, both they
are rarely needed (I suppose, also WRT short strings in Mac OS).
Similarly with FPC/Delphi and ansistrings <->pchars. Btw, function results
are handled by the conversion?
...
So this might indeed be an option for short strings (i.e., no direct
support, but automatic conversions). It will certainly be easier to
implement, but it lacks some features (such as bianry
compatibility), so I might go for the full support anyway, if it
doesn't present unsurmountable problems ...
I meant these conversions as additional to the BP featureset, to
interoperate with your own string type. And afaik they already have some
workarounds like that using the records and some operator overloading? (Peter?)
I don't think you can do without full support. Specially the "open array"
string compat is too important to leave out (more important than the ref
thing), because this is the main way to have generic shortstring string
routines.
...
...
The shortstring implementation is simply something
like
procedure setlength(var s:shortstring;x:integer);
begin
 if x>255 then x:=255;
 s[0]:=chr(x);
end;
This only works for strings of capacity 255.
Correct, my shortstringese is a bit rusty. Subst 255 with high(s) for a
general TP compatible solution.
...
...
This is not a problem in general. A hybrid system always has penalties, and
people _choose_ to use it. Mostly subsystems are internally one string type,
and only the interfaces between the subsystems aren't.
I hope so. OTOH, I've seen in the past a lot of BP programmers use
CStrings (i.e., "PChar") throughout in their Pascal code, after they
were added to BP (version 6 or 7), probably also because of
Borland's marketing them as the next big thing.
I think more that it was the only way to get around the size limits. Also
the TP/BP windows compat was directer on the API than the VCL based Delphi.
But there certainly was a bit of a myth in TP times that pchars were faster,
partially also because they were from "C".
There was some truth in that (shortstrings copy too much), and carefully
crafted code could be better, but the speed advantage was vastly exagerated,
specially compared with the fact that it was way easier to do anything wrong
with pchars, and complex string code was way more work.
The problems were that the emerging 32-bit C compilers BP was compared to
were simply faster because they optimized and were 32-bit (using a 32-bit
move() routine for their copies), and BP was a codegenerator from
yesteryear.
...
(I admit I almost fell for it myself, but when I saw the drawbacks, I
converted back to short strings what I had changed to CStrings already,
fortunately, so I could easily convert the code to EP strings with GPC
later.)
I mostly used shortstrings in that time and pchars only to break that limit
if needed. However when that really got important, because the path sizes
exploded, I was already using FPC and ansistrings were stable.
...
...
Copying conversions are required for literals anyway, but they can't be
avoided, since shortstrings are not lazily assigned.
I'm not sure if that's required. We could probably also emit the
literals as short strings when needed (like we do with literals used
as CStrings). We'll see which will be easier in the end.
I meant the following situation:
procedure x;
var s : string;
begin
  s:='bolalalallala';
end;
On the assignment some copy must follow from where the const is stored to
the stack.