I recently needed to implement a simple fixed-point math library, and found that there were few good online resources on how to implement one and that many of the simple free implementations had subtle errors. I decided to distil out a few details I would have liked to have had when I started.

## The Basic Idea

The basic idea is to use regular integer types to represent fractional values.

Fixed-point math is most commonly used for systems that lack an FPU or when you need a few more ounces of performance or precision than the standard floating point types can provide (hint: this is rare).

Fixed-point values are *much* less convenient to work with than floating point values. You should only use them as a last resort. Even if you’re on a platform without an FPU the compiler provided software implementation of floating point numbers is often fast enough.

## Representation

There are several ways to represent a fixed-point value. The best representation will depend quite heavily on your particular use case. So do your research.

The simplest way (as far as I know) is to pick a scaling factor (powers of two are the most convenient) and then store the values you want to represent, scaled by that factor. So to convert a floating point number to this fixed-point representation you just multiply the number by your scaling factor and round to an integer. The resulting value is your fixed-point representation of the given value:

`fixed_point_value = round(floating_point_value * scaling_factor)`

To convert from your integer fixed-point representation back to floating point you cast your fixed-point value to a float, and then divide by your scaling factor:

`floating_point_value = ((float)fixed_point_value)/scaling_factor`

For this post lets use a signed 32bit interger base type with 2^16 (65536) as the scaling factor. This gives us 16 integer bits (minus the sign bit) and 16 fractional bits.

So in this representation:

- 1.0 would be represented by 1.0*2^16 or 65536,
- -0.25 would be represented by -0.25*2^16 or -16384
- π would be represented by 3.14159*2^16 or 205887
- etc..

(I’ll assume this format for the rest of the post)

## Basic Operations

One of the advantages of the particular representation I picked is that most of the basic math operations are trivial. The **comparison operations**, **addition**, **subtraction**, **negation**, **bit shifting**, require no modifications. They work just like they do on regular integers. even bitshifting still multiplies and divides by 2 just like with regular integers. Also, *shockingly* (though obvious in hindsight), **modulus** also works unmodified.

## Multiplication & Division

Multiplication and division are a bit trickier.

### Multiplication

Remember that your fixed-point representation has the scaling factor built into it. So when you are multiplying or dividing your fixed-point representation you are also multiplying or dividing these scaling factors as well. This ends up being a problem. Say you had a number *x* and a scaling factor *s*. In your representation you are storing *s*x*. When we multiply two numbers together we have: *s*x * s*y* or *s*s*x*y*. But what we really want is *s*x*y*! We need to divide out one of those extra scaling factors.

The wrong way:

int32_t fixed_mul_16(int32_t x, int32_t y) { return (x * y) / (1 << 16); } |

On the surface, this looks ok (and I’ve seen a few fixed-point math libraries that do this), but you lose half of your precision this way. The *x*y* can overflow even if the result of the multiplication is still representable in our fixed-point format.

The right way:

int32_t fixed_mul_16(int32_t x, int32_t y) { return ((int64_t)x * (int64_t)y) / (1 << 16); } |

Casting to 64 bit to do the multiply and then dividing out will ensure that the intermediate value doesn’t overflow when the result is still representable.

“But wait!” you say. “Can’t you replace that division with a bit shift and do something like this?”

int32_t fixed_mul_16(int32_t x, int32_t y) { return ((int64_t)x * (int64_t)y) >> 16; } |

Yes, but there is a catch.

In C, bit shifts of negative numbers are problematic. Specifically:

- Left shift of a negative number is an undefined operation.
- Right shift of a negative number is an implementation defined operation.
- Basically, don’t bit shift negative numbers in C.
- You
*can*still use bit shifts, you just need to handle the negative number case specially.

### Division

Division is like multiplication, except backwards. If we want to compute *x / y* we’d have* (s * x) / (s * y)*, which gives *x / y*. But we really want *s * (x / y)*. We need to multiply in a scaling factor.

**The Wrong Way**

int32_t fixed_div_16(int32_t x, int32_t y) { return (x / y) * (1 << 16); } |

Again, you lose half of your precision.

**The Right Way**

int32_t fixed_div_16(int32_t x, int32_t y) { return ((int64_t)x * (1 << 16)) / y; } |

And that’s it! Now you can implement all basic operations for a fixed-point representation!

## A Few Gotchas

- I highly recommend wrapping your fixed-point types in a structure so you don’t accidentally use native operations on them.
- Watch out for the type promotion rules in C/C++.
`int64_t x = 1 << 34;`

gives`x == 0`

- Multiplication and division can rapidly use up available precision, large (or very small) intermediate results in a computation can easily cause overflows. This is why floating point is much better if you can use it instead.
- Division by zero is not the only integer division case that can cause an exception.
`INT32_MIN/-1`

(`-2147483648/-1`

) (for 32bit machines) will also cause an exception. This is not very common for regular integers but is much more likely with a fixed-point representation. - This implementation doesn’t round the result. It just truncates. This causes a small loss of precision in the multiplication and division operations.

That’s it for today. Perhaps I’ll make a follow up post that covers rounding, and how to implement advanced math functions like the trig functions (I found taylor series and generalized continued fractions to work quite well).

This is a very well written intro to fixed point math, Job. I’d love to see you continue to more advanced topics. There are so few good resources on doing fixed point properly.

A couple examples: The casts to int64_t’s work, but they’re a time-consuming hack that won’t always be necessary. Second, you’ve talked about a convenient case where all your fixed point numbers have the same radix point… but most real world applications have variables at different radix points. The concepts are the same, of course, but there are some new gotchas introduced.

For anyone who wants to become a fixed-point master, I’d recommend learning about Interval Arithmetic. You’ll be glad you did.

[…] you’re using a datatype that is more restrictive than a float (like a fixed point type), I would recommend trying to get your basic algorithms working in a scripting language using […]

I came to look at this page to remind myself of certain fixed point math notions. I was confused by the code segments because I did not recognize the operator being used. This is apparently because intervening layers of the internet have changed the left and right shift operators in the code examples to the html safe versions of these operators. The penny dropped when I read this:http://stackoverflow.com/questions/891097/what-does-the-lt-mean-in-c-not-modulus-i-know-what-that-does