Sunday, June 25, 2006

Effective optimizations for graphic programmers


Warning: The post contains a bit of technical stuff, along with some source code. Read on if you're interested.

I don't claim to be an expert, or even an intermediate in the vast field of graphic programming, but here are a few cool optimizations/replacements that I have been using for common math functions. This list is by no means complete, and by no means are most of the optimizations my inventions. If anyone wishes to suggest any more, please feel free to do so.

Graphic programming, 2D as well as 3D, involves extensive use of math, especially Vectors and Matrices. And it therefore extensively uses several math functions, like trigonometric ratios, squares and square-roots etc. Vectors also extensively involve "inverse-square-root" i.e. 1/sqrt(x) i.e. one divided by square root. This invSqrt is used in everything from finding magnitudes to finding angles between vectors, projections etc. The main point of this paragraph is that graphics heavily utilize math.

Now before we can make the next Doom3 engine, we need to first get the engine up-to-the-mark. Its no good having a fantastic engine intended for games, only to find it running like a slide-show on even high-end systems. The reason John Carmack is famous is because he has written all the game engines of id software, famous for the Doom, Wolf3D, Quake-series, Doom3 engines/games, and reading his code reveals some amazing tricks and optimizations. All the code upto Quake-3 (inclusive) has been made Open-Source under GPL, so you can try downloading and taking a peek if you're interested. I recommend that. So it turns out that most math functions can be hacked a lot to gain a great deal of optimization and speed.

A) Inverse Square Root
First I look at the famous "Inverse-Square-Root" function that was first seen in Quake-3 and is based on the Newton-Raphson method for approximating functions. The function in this case is 1/sqrt(x). To read in detail about how this works, you can see this and this (more technical). The code:

float InvSqrt(float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x; // get bits for floating value
i = 0x5f375a86- (i>>1); // gives initial guess y0
x = *(float*)&i; // convert bits back to float
x = x*(1.5f-xhalf*x*x); // Newton step, repeating increases accuracy
return x;
}
Test Result: I had tested this before using it in my code. For calculating the inverse square roots of 500000 numbers, the standard compiler method of 1/sqrt(x) took on an average 132 milliseconds, while the above code took on an average 18 milliseconds.

B) Square Root:
Then digging around Quake's source, I found an optimization for the simple square root function, which is essentially the same thing as above with a little modification. The Code:

float fSqrt(float number) {
long i;
float x, y;
const float f = 1.5F;
x = number * 0.5F;
y = number;
i = * ( long * ) &y;
i = 0x5f3759df - ( i >> 1 );
y = * ( float * ) &i;
y = y * ( f - ( x * y * y ) );
y = y * ( f - ( x * y * y ) );
return number * y;
}
Test Result: To calculate the square root of 500000 numbers, the standard compiler function took on an average 110 milliseconds, while the above code took on an average 26 milliseconds.

C) Square, Cube etc.
This I found on my own. This method is best used if you know the exponent (the power to raise to) beforehand. There are two ways of raising a number to a specific power. One is to use the simple compiler pow(x, n) function which raises x to the power n. I've often observed people using this method if 'n' is large.. above 5 or 6, even if they know beforehand (while coding) the power to raise to. So they use the pow() function. The other way is to simply multiply that number as many times as required by itself. So if I want to raise x to 6, I'll do: [x = x*x*x*x*x*x]. Now it might seem odd, since this method looks so ugly (imagine doing that for 15th power), but believe it or not, the speed difference is amazing. Use the second method always, even if you're raising to the 15th power.
Test Result: I performed two separate tests. In each case, as usual, I took 500000 test numbers. I tried simple "squaring" using the pow() compiler function, and it took on an average 45 milliseconds. I tried the other ugly method of x*x, and it took 2 millseconds. I then tried raising 500000 numbers to the 9th power, and the pow() compiler method took on an average 67 milliseconds, while the [x*x*x*x*x*x*x*x*x] method took just 3 milliseconds. This proves this uglier way to be much more efficient.

D) Trigonometric Ratios
This is quite a common trick, and I therefore won't be putting up a result sheet for this one. The idea lies in making a 'lookup table' beforehand for all those trigo functions you are using. This is because the compiler trigo functions are VERY expensive (take time) and its bad to use them while rendering. So suppose we use cosine in our program, what we do is, make a table (array) of all the angles that the program might use and a table (array) of the cosines of those angles. So the speed boost is immediate, as picking a random element in the array (random access) is a very fast operation. Now generally we have an idea of the range of angles that we might use, and the smallest required intervals between each angle, for example, we might just need whole number angles, not fractional angles, thus we need to make, say, just 90 cosines. Or if the smallest step between angles is 0.5 degrees, we need to calculate the cosine of 0.0, 0.5, 1.0, 1.5 and so on.. while the program is starting up. After that we need to just look up the table and get the cosine very quickly. Incase the angle required is not *exactly* present in the table, like we need 13.7 degrees and we have 13.5 deg in the table, it makes sense to pick the cosine of 13.5 and be happy with it, since in Graphic programming, for real-time work it is smart work to make small compromises here and there to gain an great deal of speed.

PS: I think I need to change the blog theme. I tend to write long entries and this theme squeezes the text into a narrow column making it look dauntingly long.

3 comments:

Ayush Gupta said...

How did you calculate the time required to perform each task? Is there an implemention in C++ (of which I am yet unaware of) that facilitates the coder to record the time taken to perform a particular funtion?

Anonymous said...

Good one,but it takes time to calculate for a reason: minimizing error and taking care of edge conditions.
What would 4^(-6) yield in your code or better how would you write a code that does it. Try on your calculator and then in your program if possible for all possible values.

Shashank Shekhar said...

These functions are not replacements for the stdlib. I forgot to mention that, but I did clearly mention that these are suitable for real-time game and graphic programming. And I still mean it because when we deal with graphics in games or such, we don't mind the *REALLY* small error in answers. I wouldn't put this up if I hadn't measured the maximum relative error which turned out to be 0.00175228. That hardly makes a difference for a gamer or a rendering engine that does not need to have RayTrace like preciseness. Even the faster RayTracing algorithms use simple hacks like these. These aren't meant to be used for day-to-day use. For a simple program making a few thousand calls to that function, it makes sense to use the standard compiler function, since a Standard C++ function would be written to work best with the compiler, but in uses of Graphics, little speed tricks would work acceptably at greater speed.

As for your question about 4^(-6), tell me, since you already know power to raise to, why not do [0.25*0.25*0.25*0.25*0.25*0.25]?

Forget 4, just use x. Suppose you want to do x^(-6), use [1.0 / (x*x*x*x*x*x)].

This method still wins since it takes avg. 27 milliseconds for 5,00,000 different numbers to avg. 88 milliseconds taken by the compiler pow() function. Still about three times faster. My method takes more time than the 3-4 milliseconds required when I did not have to divide, but this time, we're dividing, and divide is generally a more expensive operation.

In the end, its upto the developer to be careful where to and where not to use these tricks. Using this everywhere might end up being costlier in other senses. Except my method of powers, all other methods are used in almost every modern graphics engine. I haven't checked if people use the powers method, but its such a stupid simple thing that it doesn't matter. Its much faster though.

Ayush, I haven't really bothered to see Std C++ functions for returning milliseconds since a particular date (generally 1st Jan 1970), but I know that Java has one. In C++, I was programming in SDL & OpenGL for graphics, and SDL has a standard function called SDL_GetTicks() which returns the number of milliseconds since a particular moment (doesn't matter since when), and we just dump that value in a variable before calling a function, and then call GetTicks() *after* that test function has been processed and subtract the two time values. This is how Frames per second value is calculated generally.

PseudoCode:
timeTaken = SDL_GetTicks();

callWhateverFunctionYouWantToTest();

timeTaken = SDL_GetTicks() - timeTaken;