Re: Floating point

1 Sep 2020


      On Mon, Aug 31, 2020 at 08:06:45PM -0700, scott andrew franco wrote:
...
Waldek,
Sure, 48 bits vs 64 bits. Why didn't it truncate the mantissa on conversion to float, ie, b := maxint?
I would have expected something like zeros on the right side.
On x86_64 gpc uses SSE unit for floating point.  So real has
64-bit IEEE format, with 53 significant bits.  IEEE says
that operations should round result and that is what happens:
closest representable number is 2^63, that is maxint + 1.
Concerning testing floating point, standard requires almost
nothing.  So it boils down to quality of implementation,
and I would argue that 64-bit IEEE format + optimizations
in gpc give high quality implementation.  But IEEE rules
(as other floating point rules) may produce results
which does not agree withj naive intuition.  In some
cases optimizations lead to results which are slightly
different than literaly performing operations according
to IEEE rules -- I consider this normal (otherwise it
would be almost impossible to optimize floating point).
Back to testing: in the Pascal spirit you should test
if realative error of operations does not exceed
assumed maximal error.  Similarly for range.  Standard
does not give you _any_ constraints on range or
accuracy.  IIUC having single floating point number,
(that is 0) is legal (but useless) implementation.
I would say that 20 significant bits and 8 exponent
bits is probably reasonable lowest limit of accuracy.
-- 
                              Waldek Hebisch

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

Re: Floating point