The BX floating point text format
The current SI is defined by published numerical values of physical constants in SI
units. These numbers are expressed, as usual in print, as decimal numbers in scientific
notation. They were (I believe) most likely produced on computing devices using
something like the IEEE754 floating point format. The latter representation is often a
slightly diffrent number, due to the difference between the bases of the fractions (10
and 2). It is therefore not entirely clear which of the two (sometimes) distinct
numbers is actually being specified. These differences are currently smaller than any
actual experimental precision, but not by very much in some cases.
On closer inspection, the problem is not with the use of decimal digits in the
representation, but rather the base of the exponent. To back up a step, scientific
notation represents a rational number of a certain restricted sort. The number may be
(in principle) any integer, but if the number has a denominator greater than one, that
denominator must in fact be a power of 10. Computer floating point works exactly the
same way except that the denominator must be a power of 2. Each format therefore has
numbers it can't represent in any finite length mantissa, and the two sets do not
coincide. The problem is worse beacuse even an integer may be apodized to a fixed
precision. The simplest thing to do, it seems to me, is to print the numbers in a
format using decimal digits with a binary exponent - bx. The bx basically substitutes
for the e in typical "%f" type notation. I have functions in C, python, and perl which
are suitable for plugging into printf/scanf type formats (though I don't know what
letter to use).
1bx1 means 1 times 2 to the 1, which is 2. 1bx5 is 32, as is 32bx0, but you could also
say 2bx4 etc. But the way that bx usually works is to show you how the number is
actually represented (in slightly mangled form) in the native floating point format.
For 64-bit IEEE754 double precision arithmetic, the integer 32 is 4503599627370496bx-47
- probably not what you expected. It's not intended to be human readable, though. It's
an exact representation. For a given data type the mantissa (the first number) will
always be roughly the same length, as you can see in the table below. Note especially
that Avogadro's number, which was given an exact integer value, is represented
inaccurately. The represented number differs by 12,976,128 atoms from the printed
value! Which of these two numbers is the "exact" value?
The table below shows the approximate differences between the two representations. It
was calculated using Intel 80-bit extended-precision floating point. Consider the
implication. The best clocks are approching a precision where double precision floating
point will have to be abandoned soon. That's easy enough when an 80-bit type is
available. But when you switch, the value of many of the constants changes. This
doesn't happen if you use bx for publishing and input, because a bx value is exactly
represented when read into a larger data type - in fact it follows the same rules as
hardware floating point type conversion. It will never be necessary to redefine bx for
new data types.
The "bx factor" is what you're already using in IEEE754!
Constant SI Expression SN factor bx factor Difference
N_A 1/mol 6.02214076e23 8973689019680023bx26 1.29761e+07
c m/s 299792458 5029682823036928bx-24 0
h kg m^2/s 6.62607015e-34 7747209898635537bx-163 1.70389e-50
e A s 1.602176634e-19 6655181362828883bx-115 1.06265e-35
k kg m^2/K s^2 1.380649e-23 4698105096070268bx-128 -9.21225e-40
nuCs 1/s 9192631770 4819586525429760bx-19 0
Here is the README file from the software download:
The software to do all of this - including the above demo - is here in this
distribution. The C library is in the lib directory, with some basic docs and the demo
code, which can be compiled and run with the provided script to_make, which also builds
the library libbx.so, header file bx.h.
The C library documentation follows, with a word about perl and python at the end.
bx is a library for printing floating point numbers in a compact format using decimal
digits, which truly and correctly represents the numerical content, and also for reading
them back in again. Functions for converting in both directions are provided, for the
data types 'double' (IEEE-754 64-bit) and 'long double' (Intel extended precision
80-bit). The bx format supports the special values (-)0, (-)inf, and (-)nan.
Conversion to and from denormal values is supported. Signalling nan is not treated
specially. These functions do not perform floating point operations on their arguments,
except to convert double to long double in the case of double denormals.
Requires -lm.
#include
char *f2bxl(long double x, char *ret);
Converts x to the bx string format in the string ret, which must already be
allocated. Normal, denormal, and pseudo-normal values are properly
represented. Zeroes, infinities, and nans also have special representations.
Pseudo-infinities and pseudo-nans are treated as nan. Pseudo-zeroes are
treated as nonzero pseudo-normals, which should never come up unless you're
testing this feature.
char *f2bx(double x, char *ret);
Converts x to the bx string format in the string ret, which must already be
allocated. Normal and denormal values are properly represented. Zeroes,
infinities, and nans also have special representations. Internally denormals
are passed to f2bxdl, which is the only case where a floating-point operation
is performed on the x argument.
char *f2bxdl(long double x, long double d, char *ret);
Converts x to the bx string format in the string ret, which must already be
allocated. If d is zero, the number is represented in full precision, but
if a nonzero value of d is passed the mantissa is truncated so that it has
no more than d decimal digits. Since d is passed as a floating point number,
an exact number of bits to remove can in effect be specified also.
Internally d is converted into an integer shift width. Normal, denormal, and
pseudo-normal values are properly represented. Zeroes, infinities, and nans
also have special representations. Pseudo-infinities and pseudo-nans are
treated as nan. Pseudo-zeroes are treated as nonzero pseudo-normals, which
should never come up unless you're testing this feature.
char *f2bxd(double x, double d, char *ret);
Converts x to the bx string format in the string ret, which must already be
allocated. If d is zero, the number is represented in full precision, but
if a nonzero value of d is passed the mantissa is truncated so that it has
no more than d decimal digits. Since d is passed as a floating point number,
an exact number of bits to remove can in effect be specified also.
Internally d is converted into an integer shift width. Normal and denormal
values are properly represented. Zeroes, infinities, and nans also have
special representations. Internally denormals are passed to f2bxdl, which
is the only case where a floating-point operation is performed on the x
argument.
char *f2bxnl(long double x, int of, char *ret);
Converts x to the bx string format in the string ret, which must already be
allocated. If of is zero, the number is represented in full precision, but
if a nonzero value of of is passed the mantissa is shifted right that many
bits before representation (used internally for f2bxdl). Normal, denormal,
and pseudo-normal values are properly represented. Zeroes, infinities, and
nans also have special representations. Pseudo-infinities and pseudo-nans are
treated as nan. Pseudo-zeroes are treated as nonzero pseudo-normals, which
should never come up unless you're testing this feature.
f2bxnl((double), 11, ret) should be equivalent to f2bx, although the f2bx
function should be used as it avoids an unnecessary FP conversion.
char *f2bxn(double x, int of, char *ret);
Converts x to the bx string format in the string ret, which must already be
allocated. If of is zero, the number is represented in full precision, but
if a nonzero value of of is passed the mantissa is shifted right that many
bits before representation (used internally for f2bxd). Normal and denormal
values are properly represented. Zeroes, infinities, and nans also have
special representations. Internally denormals are passed to f2bxdl, which
is the only case where a floating-point operation is performed on the x
argument.
f2bxn((single), 29, ret) should produce a value that can be read back into
a single via conversion from bx2f.
long double bx2fl(char *bx);
Returns the floating point value of a valid bx string (may begin with space).
Handles the special values (-)0, (-)inf, and (-)nan. If the format of the
string is not recognized, or if the values of the mantissa or exponent are
out of range of the long double data type, nan is returned and an error is
generated on stderr. Denormal values are supported, and only in the case
of denormals is any truncation performed - the mantissa may be shifted to
the right to fit the representation. No rounding is otherwise performed.
Values which came from f2bx* will always be fully accurate.
double bx2f(char *bx);
Returns the floating point value of a valid bx string (may begin with space).
Handles the special values (-)0, (-)inf, and (-)nan. If the format of the
string is not recognized, or if the values of the mantissa or exponent are
out of range of the double data type, nan is returned and an error is
generated on stderr. Denormal values are supported, and only in the case
of denormals is any truncation performed - the mantissa may be shifted to
the right to fit the representation. No rounding is otherwise performed.
Values which came from f2bx or f2bxd/n will always be fully accurate.
In general strings generated by f2bxl will not be readable, and should be
passed to bx2fl instead.
See the note under f2bxn about using this function for single precision.
bx.pl contains a perl library with functions f2bx and bx2f, which are very
similar to f2bxd/f2bx ($d defaults to 0) and bx2f in the C implementation.
Normals, denormals, zeroes, infinities, and nans are handled, in double
precision only. Floating-point operations are avoided.
bx.py contains basically the same thing for Python, but without support for
inf and nan, and using some floating-point arithmetic. Normals should work
correctly. I don't know the language well enough to plug in the compiled C
version, which would be best in the long run.