Fixed Point Arithmetic

The Problem

Say we want to scroll the background left a little bit every frame. Our program loop is operating at ~60hz so adding 1 pixel at a time would scroll it 60 pixels/second, that might be too fast! The solution is to add a fraction of a pixel every frame, like 0.25 instead of 1.

Now, the obvious easy method to do this with PC programming is to use floating point numbers. Floating point arithmetic may take advantage of an FPU (floating point unit) to calculate special numbers with a fractional part.

The problem with floating point on the DS is that the DS does not have an FPU. Floating point operations are emulated by the CPU. This makes them very very slow, and it eats up your processing time.

The Solution

Here is a great alternative to floating point numbers: Fixed point numbers!! :D

Fixed point numbers are in a special numbering format where the fractional part and the whole 'integer' part are separated. An example... hmm, look at your clock (I hope it's digital).

When the minute reaches 60 then the hour will increment, this is like a fixed point format, with the fractional range being 0->60.

Now, to represent the clock value (hours and minutes only) in a single value, the value can be 13*60 + 37. Anything we add to this value will be in the scale of 'minutes'.

But, now we have a problem, the clock value is mixed together... To retreive the value, we can do:

hours = INTEGER(value / 60)
minutes = value % 60

This causes a big problem when doing stuff on handheld consoles that aren't very powerful (it's also a problem when programming for any computer). The above formulas require a divide operation to calculate the value, divides are usually quite slow, no matter what computer you are programming for.

To fix this problem, we can be smart and use a power of 2 as the fractional range (you can stop looking at the clock now). If the fractional range is a power of 2, then we can use bit-shifts/masking instead of division to compute the two values!

For example, I have this number here: 1502. I wonder what real value it represents?? Well, it depends on what format the number is in. If we are using fixed point with a fractional range of 28, then the value will be:

whole = INTEGER(1502 / 28) = 1502 >> 8 = 5
fraction = 1502 % 28 = 1502 & 255 = 222
VALUE = 1502 / 256 = 5.8671875 (!)

These calculations can be done very fast by computers. Bit shifts can be done in a single instruction (1 clock cycle, you get over 60 million cycles/second).

From here onward, when I refer to fixed point numbers, I am only referring to those which have fractional ranges in powers of 2.

Different Formats

There's probably many ways people represent the integer and fractional size for a fixed point number. The one I see often is the I.F format, where I is the number of bits that represent the integer portion, F is the number of bits that represent the fractional portion, and I+F is the size of the container. If a number has 5 bits representing integer part, and 5 bits representing fractional part, then it would be called "5.5". Sometimes a little prefix is added too, like "f5.5" or "fx5.5". There isn't really a standard notation.

A commonly used format is 24.8, where the fraction can be incremented 256 times before the integer part is incremented. The first byte (8 bits) of data hold the fractional part. The other 3 bytes (24 bits) contain the integer value.

The 8 bits of fraction is usually enough data for most operations. People sometimes increase the number of bits for the fractional part when accuracy is an issue. (And they also have to be cautious when decreasing the integer size).

In a perfect world, we will probably prefer the fixed point format with an infinite integer size (and a super super huge fractional size). :)

Converting Between Fixed Point and Decimal

This is pretty easy to understand. To convert a decimal number to a fixed point value, multiply the decimal number by 2f where f is the number of bits that represent the fractional part in the fixed point value. After converting you must throw away the fraction; round to the nearest integer. So that's x = ROUND(2df).

To convert the opposite way, simply divide instead of multiplying, or negate the exponent. d = x2-f

Accuracy

Now, after converting to fixed point, the number must discard the value after the decimal point. This introduces a little bit of an innacuracy problem. Most of the time an 8 bit fraction gives good enough accuracy, but sometimes more is needed.

For an example, let's convert 1.11 to fixed point using 8 bits fraction.

284.16 = 1.11 × 28

We have to throw away the part after the decimal point so it's an integer value; we round to 284.

This is the actual value we get, BUT, it is slightly off from its source. When we convert back to decimal, we get:

1.109375 = 284 × 2-8

The error is 0.000625. That's a pretty small error, but it may [will] add up if you accumulate the results. Keep accuracy in mind when you program some advanced fixed point stuff!

Addition and Subtraction

Let's start with the basics. You can add and subtract fixed point numbers just like any regular integer. BUT, if the two numbers are in different formats you must convert one format to the other before adding/subtracting.

If you go back to fourth grade, you might remember learning how to align the decimal points before you add something. We can use this concept here.

In this example, we are adding a 24.8 fixed point value to a 28.4 one. We must align the fractional parts before adding the two numbers (like aligning the decimal points). We shift the second value left 4 spaces to match the fraction of the other addend. Since we are using 32-bit operations, shifting the 28.4 value overflows past the 32-bit work area. If those top 4 bits of the number have data in them, the data will be lost and there will be an error in the calculation. This limits the integer size of the second number to only 24 bits. The result is a 24.8 fixed point value.

To avoid the overflowing problem, and have less accuracy instead, we can shift the first value right 4 bits to align, instead of the second value shifted left 4 bits. This would affect the fractional size of the result too, giving us a 28.4 number.

You can also shift both of the values to match the fractional parts, you need to decide which method is neccesary for the operation in question.

Subtraction is exactly the same as addition, except replace PLUS with MINUS. :P

Multiplication

Now it gets a little more complicated. When we multiply two fixed point numbers the fractional and integer sizes get added to each other, we must ensure that the product does not overflow past the 32-bit working area.

In this example, we multiply a 24.8 number with a 28.4 number. The result we get is a 52.12 number, but, since the number exceeds the 32-bit region, the top 32 bits cannot be used; the integer part must be truncated to 20 bits. This limits the value of the result significantly. If there is data in those top 32 bits, then it will be lost, and the result will be invalid. There isn't usually a need to multiply two huge numbers by each other.

After the multiplication, the example shifts the truncated result right 4 bits to restore 24.8 format.

Now, the shifting can be done before or after multiplication, or maybe before AND after, or even... not at all. It all depends on how much accuracy you need, how big the numbers will be (usually the 24 bit integer isn't fully used), and/or what output format you want. If you shift the value before multiplication, you will lose data in the fractional part, causing some error in the result. If you shift afterwards, the product will be somewhat limited due to the 32-bit work boundaries. If you shift before and after, then you'll get a fraction of both bad stuff.

Sometimes you don't want to shift at all, or you want to shift to another amount. Say you need a *.8 fixed point number from 2 factors. You have a *.0 whole number, and another *.8 fixed point number. You multiply the *.0 by *.8; you end up with a *.8 in the result. The result doesn't need any shifting because this is already the desired format.

If you really need the accuracy and big numbers multiplied, you can use 64-bit operations to double the work boundaries. 64-bit operations may be a little bit slower, but sometimes they are needed.

Division

Division is a bit interesting. After you divide two fixed point numbers, the fractional size of the dividend gets subtracted by the divisor's fractional size.

Now, when you divide 2 numbers with the same format. You will lose all of the fractional part (top-left example). This usually isn't desired! This inaccuracy is prevented by shifting the dividend left by some amount (usually equal to the divisors fractional size). This is shown in the bottom example.

Another interesting effect is when you divide by a number with a greater fractional size. The result will be the quotient divided by some amount. The value needs to be shifted left after the division to get the real integer number.

Heh

Anyway, you'll get the hang of it some day. I hope you got a little grasp of it, we'll be applying a bunch of it in the next chapter.

Previous: SpritesContentsNext: The Bouncy Ball