On Mon, Aug 31, 2020 at 08:06:45PM -0700, scott andrew franco wrote:
Waldek,
Sure, 48 bits vs 64 bits. Why didn't it truncate the mantissa on conversion to float, ie, b := maxint? I would have expected something like zeros on the right side.
On x86_64 gpc uses SSE unit for floating point. So real has 64-bit IEEE format, with 53 significant bits. IEEE says that operations should round result and that is what happens: closest representable number is 2^63, that is maxint + 1.
Concerning testing floating point, standard requires almost nothing. So it boils down to quality of implementation, and I would argue that 64-bit IEEE format + optimizations in gpc give high quality implementation. But IEEE rules (as other floating point rules) may produce results which does not agree withj naive intuition. In some cases optimizations lead to results which are slightly different than literaly performing operations according to IEEE rules -- I consider this normal (otherwise it would be almost impossible to optimize floating point).
Back to testing: in the Pascal spirit you should test if realative error of operations does not exceed assumed maximal error. Similarly for range. Standard does not give you _any_ constraints on range or accuracy. IIUC having single floating point number, (that is 0) is legal (but useless) implementation. I would say that 20 significant bits and 8 exponent bits is probably reasonable lowest limit of accuracy.