3 Comments

Simple Fixed-Point Math

I recently needed to implement a simple fixed-point math library, and found that there were few good online resources on how to implement one and that many of the simple free implementations had subtle errors. I decided to distil out a few details I would have liked to have had when I started.

The Basic Idea

The basic idea is to use regular integer types to represent fractional values.

Fixed-point math is most commonly used for systems that lack an FPU or when you need a few more ounces of performance or precision than the standard floating point types can provide (hint: this is rare).
Fixed-point values are much less convenient to work with than floating point values. You should only use them as a last resort. Even if you’re on a platform without an FPU the compiler provided software implementation of floating point numbers is often fast enough.

Representation

There are several ways to represent a fixed-point value. The best representation will depend quite heavily on your particular use case. So do your research.

The simplest way (as far as I know) is to pick a scaling factor (powers of two are the most convenient) and then store the values you want to represent, scaled by that factor. So to convert a floating point number to this fixed-point representation you just multiply the number by your scaling factor and round to an integer. The resulting value is your fixed-point representation of the given value:

fixed_point_value = round(floating_point_value * scaling_factor)

To convert from your integer fixed-point representation back to floating point you cast your fixed-point value to a float, and then divide by your scaling factor:

floating_point_value = ((float)fixed_point_value)/scaling_factor

For this post lets use a signed 32bit interger base type with 2^16 (65536) as the scaling factor. This gives us 16 integer bits (minus the sign bit) and 16 fractional bits.

So in this representation:

  • 1.0 would be represented by 1.0*2^16 or 65536,
  • -0.25 would be represented by -0.25*2^16 or -16384
  • π would be represented by 3.14159*2^16 or 205887
  • etc..

(I’ll assume this format for the rest of the post)

Basic Operations

One of the advantages of the particular representation I picked is that most of the basic math operations are trivial. The comparison operations, addition, subtraction, negation, bit shifting, require no modifications. They work just like they do on regular integers. even bitshifting still multiplies and divides by 2 just like with regular integers. Also, shockingly (though obvious in hindsight), modulus also works unmodified.

Multiplication & Division

Multiplication and division are a bit trickier.

Multiplication

Remember that your fixed-point representation has the scaling factor built into it. So when you are multiplying or dividing your fixed-point representation you are also multiplying or dividing these scaling factors as well. This ends up being a problem. Say you had a number x and a scaling factor s. In your representation you are storing s*x. When we multiply two numbers together we have: s*x * s*y or s*s*x*y. But what we really want is s*x*y! We need to divide out one of those extra scaling factors.

The wrong way:

  int32_t fixed_mul_16(int32_t x, int32_t y)
  {
    return (x * y) / (1 << 16);
  }

On the surface, this looks ok (and I’ve seen a few fixed-point math libraries that do this), but you lose half of your precision this way. The x*y can overflow even if the result of the multiplication is still representable in our fixed-point format.

The right way:

  int32_t fixed_mul_16(int32_t x, int32_t y)
  {
    return ((int64_t)x * (int64_t)y) / (1 << 16);
  }

Casting to 64 bit to do the multiply and then dividing out will ensure that the intermediate value doesn’t overflow when the result is still representable.

“But wait!” you say. “Can’t you replace that division with a bit shift and do something like this?”

  int32_t fixed_mul_16(int32_t x, int32_t y)
  {
    return ((int64_t)x * (int64_t)y) >> 16;
  }

Yes, but there is a catch.

In C, bit shifts of negative numbers are problematic. Specifically:

  • Left shift of a negative number is an undefined operation.
  • Right shift of a negative number is an implementation defined operation.
  • Basically, don’t bit shift negative numbers in C.
  • You can still use bit shifts, you just need to handle the negative number case specially.

Division

Division is like multiplication, except backwards. If we want to compute x / y we’d have (s * x) / (s * y), which gives x / y. But we really want s * (x / y). We need to multiply in a scaling factor.

The Wrong Way

  int32_t fixed_div_16(int32_t x, int32_t y)
  {
    return (x / y) * (1 << 16);
  }

Again, you lose half of your precision.

The Right Way

  int32_t fixed_div_16(int32_t x, int32_t y)
  {
    return ((int64_t)x * (1 << 16)) / y;
  }

And that’s it! Now you can implement all basic operations for a fixed-point representation!

A Few Gotchas

  • I highly recommend wrapping your fixed-point types in a structure so you don’t accidentally use native operations on them.
  • Watch out for the type promotion rules in C/C++.  int64_t x = 1 << 34; gives x == 0
  • Multiplication and division can rapidly use up available precision, large (or very small) intermediate results in a computation can easily cause overflows. This is why floating point is much better if you can use it instead.
  • Division by zero is not the only integer division case that can cause an exception. INT32_MIN/-1 (-2147483648/-1) (for 32bit machines) will also cause an exception. This is not very common for regular integers but is much more likely with a fixed-point representation.
  • This implementation doesn’t round the result. It just truncates. This causes a small loss of precision in the multiplication and division operations.

That’s it for today. Perhaps I’ll make a follow up post that covers rounding, and how to implement advanced math functions like the trig functions (I found taylor series and generalized continued fractions to work quite well).