The IEEE 754-2008 standard for floating-point arithmetic defines minNum(x, y)
and maxNum(x, y)
functions that compute the minimum and maximum of two
floating-point numbers respectively . This is simple as long as one number is
greater than the other, but the functions have some aggravating corner cases:
def maxNum(x, y):
if x > y:
return x
if x < y:
return y
if x == y:
# What is maxNum(+0, -0)?
return bitand(x, y)
# What about NaN?
...
Negative zero
The first corner case is maxNum(+0, -0)
; should the result be +0
or
-0
? The IEEE standard leaves this up to the implementation. The ARM and MIPS
architectures both choose to compare as if -0 < +0
for the purposes of min and
max. This can be implemented as the bitwise and
of x
and y
when they compare
equal using the normal floating-point equality as shown above.
The Intel SSE maxss
and maxsd
instructions mimic the C expression x > y ? x :
y
, so they would return y
in this case:
>>> maxss(+0, -0)
-0
>>> maxss(-0, +0)
+0
These instructions are not commutative.
Not a number
What if one of the operands is a NaN? Since many other IEEE primitives like
addition and subtraction always produce a NaN when one of the operands is a NaN,
it would make sense for min and max to do that same. Indeed ARM has a fmax
instruction which does that:
def fmax(x, y):
if x > y:
return x
if x < y:
return y
if x == y:
return bitand(x, y)
return NaN
IEEE however, treats NaN as a missing value for the purpose of the minNum
and
maxNum
functions. They will suppress a single NaN operand and return the
number instead:
def maxNum(x, y):
if x > y:
return x
if x < y:
return y
if x == y:
return oneOf(x, y)
if not isNaN(x):
return x
if not isNaN(y)
return y
return NaN
The rationale for this behavior is not completely clear to me. This is from Prof. Kahan’s notes:
Some familiar functions have yet to be defined for NaN. For instance
max{x, y}
should deliver the same result asmax{y, x}
but almost no implementations do that whenx
is NaN. There are good reasons to definemax{NaN, 5} := max{5, NaN} := 5
though many would disagree.
There is further discussion of this choice in the comments on a Julia language issue.
Signaling NaNs
There are two types of NaNs: quiet and signaling NaNs. Quiet NaNs can be
produced by invalid operations like 0 / 0
or ∞ - ∞
, but signaling NaNs are
not produced by normal arithmetic operations, although they can be propagated by
certain functions like negate
and copySign
that only manipulate the sign bit.
All other operations are invalid with a signaling NaN operand, causing them to
produce a quiet NaN result. This includes the minNum
and maxNum
functions:
def maxNum(x, y):
if isSignaling(x) or isSignaling(y):
return NaN
if x > y:
return x
if x < y:
return y
if x == y:
return oneOf(x, y)
if not isNaN(x):
return x
if not isNaN(y)
return y
return NaN
This means that these functions suppress quiet NaNs while signaling NaNs are converted to a quiet NaN:
>>> maxNum(1, NaN)
1
>>> maxNum(1, sNaN)
NaN
The conversion of signaling NaNs to quiet NaNs as they are propagated has the unfortunate effect of making these functions non-associative:
>>> maxNum(1, maxNum(sNaN, 2))
1
>>> maxNum(maxNum(1, sNaN), 2))
2
It is quite unfortunate that computing the maximum of a set containing the value
2 can produce the result 1. It seems that the answer should be either 2 or NaN.
This bad behavior when maxNum
functions are chained appears because
- Signaling NaNs are propagated as quiet NaNs.
- Quiet NaNs are supressed rather than propagated.
Signaling NaNs also latch an invalid operation floating-point exception, but very few programs use the floating-point status flags. Trapping on floating-point exceptions is almost always disabled.
The ARMv8 instruction set has an fmaxnmv
instruction which computes the IEEE
maxNum
function across the lanes of a SIMD vector:
def fmaxnmv(v):
return maxNum(maxNum(v[0], v[1]), maxNum(v[2], v[3]))
Since the underlying function is non-associative, we can get different results simply by permuting the lanes of the SIMD vector:
>>> fmaxnmv([sNaN, 1, 2, 3])
3
>>> fmaxnmv([1, 2, 3, sNaN])
2
This sets the invalid operation bit in the FPSR
floating-point status
register as the only indication that something might be wrong.
Mitigation
So how can we compute the min or max of a set of floating-point numbers without losing our sanity? There’s a couple of options:
- Use min and max functions that propagate all NaN operands, like the ARM
fmax
instruction described above. If any member of the set is a NaN (quiet or signaling), the result will be NaN. - Use min and max functions that suppress all NaN operands, quiet or signaling.
This is valid to do in a C implementation since the C11 specification leaves
the handling of signaling NaNs implementation-defined. On OS X, this if how
the
<math.h>
fmin()
andfmax()
functions behave. - Detect invalid operation floating point exceptions. Signaling NaN operands cause operations to latch an invalid operation exception and return NaN. If such an exception was latched while computing the maximum element of a set, the result can’t be trusted.