PETROCKET: 2009

Saturday, September 12, 2009

How Much Memory Does This Planet Need?

These are the numbers I'm using for the planet in the video and below are estimates of how much uncompressed memory would be needed for a planet with these properties:

Patch Size (# vertices on a single patch edge): 33
Max Depth of Quad Tree: 5
Texture Size Per Patch: 128 x 128

Assuming I could use a single height map, normal map and tangent map and sample them by using geomipmapping:
16 bit height map memory = 6 faces x ((33 verts per patch x 2^(max depth + 1)) - (2^(max depth + 1) - 1))^2 x 2 (because 16 bit)
= 6 x ((33 x 64) - (64 - 1))^2 x 2
= 6 x (2049 x 2049) x 2
= 6 x 8,396,802
= 58,777,614 bytes

8 bit normal map memory = 6 faces x ((33 verts per patch x 2^(max depth + 1)) - (2^(max depth + 1) - 1))^2 x 3 (for x,y,z)
= 6 x (2049 x 2049) x 3
= 75,571,218 bytes

tangent map is the same size as normal map

128x128 textures memory: this is a summation for each level or you can just calculate the amount needed for the deepest level and multiply by 1/3rd for a ceiling estimate (same estimate used when doing mip mapping) because each level is one mip map level lower than the level below it.
= 6 faces x texture size x num patches at lowest level x num bytes per pixel
= 6 faces x (128 x 128) x (2^(max depth + 1))^2 x 3
= 6 x (128 x 128) x 4096 x 3
= 1,207,959,552 bytes
ceiling = 1,207,959,552 + (1,207,959,552 / 3) = 1,610,612,736 bytes

So the total is:
58,777,614 + 75,571,218 + 75,571,218 + 1,610,612,736
= 1,820,532,786 bytes

So just under 2 GB for a single planet and nothing else. While 2GB is not an unreasonable amount of RAM to ask a gaming rig to set aside these days it is still too much if you want to have a lot of other models, animations and textures loaded. That's why I've been working on resource management more than anything else.

I would like to allow the entire planet to be custom made and loaded from disk rather than just generated on the fly and I'd like to be able to support a combination of both. The idea is that a planet starts out randomly generated and then get's customized by the artist - this should work well in cases where you have a couple continents and the rest is ocean floor, or where you have a moon that is not customized at all. No reason for your artists to spend time on the ocean floor usually, and hopefully they can tweak the generation parameters to get close the the type of land formations they want.

Really, I could load the height map into memory and always generate the normals and tangents based on the height map. The textures would be too large to keep all in memory at once so I would still have to load/unload those as needed.

Alternately, I could scrap the whole texture per patch thing and go back to some blended texture pixel shader based on a blend map that is the same resolution as the height maps ( 6 x (2049 x 2049) x 8 layers = 201,523,248 bytes)

Saturday, September 5, 2009

MyFirstPlanet Video

Here's the first video showing the planet with textures and atmosphere shader. I used fraps (windows only) and recorded it while running in OpenGL mode. DirectX would probably be smoother but I haven't ported the shaders to HLSL yet.

My First Planet - Work In Progress - Ogre3d Engine from petrocket on Vimeo.

Now I'm still working on a simple way to pull terrain patch textures and normal maps from the disk when they're available instead of creating them on the fly - this way I hope to be able to modify the diffuse textures and customize the planet (so easy to say, so hard for me to do).

My first idea is to create a resource pool for the terrain patch diffuse images and normal map images and then load files from disk into those resource pools as need from the build thread. Not sure if this will work well enough because it may just not be loading enough images in advance to keep ahead of demand.

I should probably create a second thread that is a low priority thread with an entirely different quad tree based on the camera position in the future (assuming the user keeps moving in the same direction). I could update this quad tree and load images for this future position when the real camera position based quad tree isn't updating/building (and when would that be???)

Tuesday, August 25, 2009

Water Shader Update & Atmosphere Added

I reintegrated the atmosphere shell/shader and updated the water shell with a heightmap of the terrain under it so I can have the water depth affect the transparency and color and I think it adds a lot - although now everything looks tropical - which I don't mind.

Water Depth Effects:

Also I made my underwater texture have fake caustics (static) on them which is subtle but nice:

Fake Caustics texture:

From Above:

And here are some pictures of the atmosphere and water shader together from various spots on the planet.

You can view all the My First Planet Images in the gallery section on my website

Notes:
- I pretty much eliminated banding in the generated terrain texture images by increasing the resolution of my terrain height maps from 8bit to 16bit.
- The atmosphere still has problems and needs tweaking - there is no sun and the sky to space transition is too abrupt amongst other things.
- PSSM VSM shadows slowed the frame rate down too like 30-40 fps so I left them out for now.
- I need to solve the issue of how noticable texture repeating is from up in the atmosphere and I need to add some noise to the texture generation to make things more interesting.

Wednesday, August 12, 2009

Mipmapping RTT in Ogre3D and Look Up Table Filtering

Two things to write about in this post about the texture generation. First of all, when you create textures via render to texture in Ogre3d you do not get mipmaps auto-generated for you. Strangely enough, DirectX still seemed to do some kind of mipmapping, but OpenGL did not. To implement it in Ogre3d you have to do a render to texture for each mipmap level you want to generate. Here's how I did it...


for(unsigned int i = 0; i <= numMipMaps; ++i) {
    // select mipmap level
    target = mTexture->getBuffer(0,i)->getRenderTarget();
    target->setAutoUpdated(false);

    if(target->getNumViewports()) {
     // get an existing viewport
     mViewport = target->getViewport(0);
    }
    else {
     // add a new viewport 
     mViewport = target->addViewport(mRTTCam);
     mViewport->setClearEveryFrame(false);
     mViewport->setOverlaysEnabled(false);
    }
    // render the quad
    mSceneMgr->manualRender(
            &mRenderOp, 
            mPass, 
            mViewport, 
            Matrix4::IDENTITY, 
            Matrix4::IDENTITY, 
            Matrix4::IDENTITY, 
            mBeginEnd //issue a _begin and _end render call
    );
}

The results were decent - not as good as hardware generated mipmaps, but better than no mipmaps and yes, it does affect the speed. Rendering 2 mipmaps isn't so bad a hit but rendering 6+ becomes noticable - sorry I don't have any hard numbers.

Secondly, I was trying to combat texture banding due to the low resolution of the height maps, slope maps, and look up table. I use filtering on the height maps and slope maps so those are blended decently, but the look up table couldn't be filtered because the look up table is really 16 look up tables combined into one texture. When I introduced filtering on the look up table the banding would be lessened but I'd get artifacts where there was bleed between look up tables. Recall the look up table texture is a 1024x256 texture that is 4 256x256 textures end on end horizontally. So the horizontal access is all that really matters because vertically there is no bleed (the texture address mode is set to clamp, not wrap). The horizontal access is the slope value and so what I did was compress the slope value and center it on the middle of the look up table texture so instead of values being from 0 - 255, now they are from 15 - 239 so I don't have to worry about bleeding artifacts anymore (no pun intended).

Here's a couple before/after images showing how the trilinear filtering on the look up table improves the banding issue:

Without filtering...

(click to zoom in)

And with filtering...

(click to zoom in)

Thoughts:
1. How can I use Multiple Render Targets to render textures faster - group them up so I can render 4 textures with one shader?
2. How can I cheaply introduce noise to the altitude and slope values to make the blending better and more random looking?
3. Why is the terrain slightly blueish and dark?
4. When should I work atmosphere and water back in.

Saturday, August 8, 2009

Planet Terrain Texture Generation Optimization Continues

I put a lot of profiling code in Ogre's SceneManager::manualRender() function and in SceneManager::_setPass() as well as in the GL_RenderSystem::_render() function etc. Turns out my profiling code in GL_RenderSystem seems to have skewed my profile results.

Profiling the code wasn't giving me many optimization ideas so I took another approach. You know how when you have slow code sometimes you just cut parts of it out till you find out what is slow about it - not the best way to optimize but it can give you some insight when you don't fully understand the design and implementation of libraries you are using (Ogre3d in my case). I could see from the profiling results that it would be good to focus on the Render To Texture(RTT) code so I started removing all the texture look ups in my RTT shader except one and the timing/stuttering issues improved noticeably. I started by adding back in each texture look up one by one to see if adding a certain texture brought back the stuttering/slowness.

It seems that some of my terrain textures were 1024x1024 while the rest were 512x512 (I have 16 terrain textures that get blended in the RTT shader). So I converted those to 512x512 and that helped - and it seems that going above ~8 512x512 textures gives unacceptable slowness and more stuttering. For some reason this is only an issue with OpenGL Render System not DirectX Render System which seems to handle 16 blends fine with no stuttering and faster render times. I may look into this more later because I think I need at least 12 terrain textures.

I might try having several large shared height maps and slope maps - one per cube face to start - instead of one small height map and slope map per terrain patch. I just like the simplicity of the current approach, but it is probably going to be too slow.

Also, I've moved up to the latest Ogre release v1.6.3

Lastly, I tried GLIntercept which was very nice and easy to use and can output textures, all the OpenGL calls and lots more. Did I mention it's free?

At some point I need to start saving rendered terrain textures and height maps and normal maps to disk then reading them back from disk instead of generating them every time.

Sunday, July 26, 2009

OpenGL Render To Texture 10x Slower than DirectX

That's my main problem right now. It takes roughly 10 - 20 times longer when I call mSceneManager->manualRender() when using OpenGL in Ogre instead of DirectX (9).

After getting my app running in DirectX, I used NVidia Perfhud to debug the framerate spikes - only to find that DirectX is vastly outperforming OpenGL! So now I can't really afford gDEBugger ($800), so I'm stuck with inserting my mini-profiler into the GL_RenderSystem code to figure out what parts are slowing down etc. That will be a good experience and help me get more familiar with Ogre RenderSystem calls.

Misc Notes:
- Make sure you play around with your RenderSystem configuration settings. I had set my Display Frequency to 59 and that caused the framerate to be cut in half. Dunno why I had it set to that value.
- With OpenGL you don't need the RenderSystem to issue _begin() and _end() frame calls, but you do in DirectX.
- Fixed a bug where when writing dynamic textures I was using pointer math and it was messing up my textures in DirectX. I switched to using array notation and it fixed the issue. Haven't spent much time figuring out why I was having so much grief with that (it worked fine with the OpenGL RenderSystem).

Next, I plan on profiling the GL_RenderSystem to figure out those RTT issues, and upgrading to the newly released Ogre version 1.6.3!!

Tuesday, July 21, 2009

Optimisation Fail

This is a maintenance/progress blog post so nothing exciting really. I just try to post a few times a month as any developer keeping a journal should.

I got rid of the vertex buffer and texture resource pools and created a mesh pool and each mesh has its' own vertex buffer and textures so the same thing is accomplished and the mesh resource manager is now way simpler. I used to have std::maps with pointers to the meshes in each list depending on what stage the mesh was in (needs building, built, and visible, cached and cache build) - now I have 3 std::vectors for 3 lists: needs building, built and visible.

The lists are stored in an array like this:

std::vector mMeshLists[3];

The built and visible lists are in index 0 and 1 and I swap them periodically like this:

mVisibleListIndex = (mBuiltListIndex + 1) % 2;

And the build list is always index 2.

Unfortunately, the app is still too jerky. Creating the textures on the graphics card is taking too long, generating the heightmaps is taking too long and it's not smooth even though I'm using threads for all the non-opengl stuff. Most of the time spikes are now happening in Ogre so I will probably have to add some profiling there to figure out what's going on.

Also, I switch over to DirectX 9 so that I can use NVidia's Perfhud tool to help me analyze how the graphics card and cpu are being utilized. That's been fun.

The only major bug I have at the moment is that the Render To Texture code that generates the terrain texture for each patch isn't working in DirectX right. My dynamic heightmap texture is not being created correctly and I don't know why yet. I have been experimenting with different pixel formats to no avail yet.

I'm almost thinking of doing another simpler version of the quad tree planet that uses geomippmapping and the planet size is restricted to 1025x1025 vertices per side - or maybe 2049x2049. This terrain is taking too much of my time to code that should be spent on other important game things!

Friday, June 26, 2009

Debug Mode Really??

Minor amendment to the previous post - I was running in debug mode hence the poor fps, and when I switched to release the framerates jumped up to OSX levels (300 in space and about 80 at surface). I expect the framerates to be lower at the surface with all the texturing going on so I'm OK with it for now

Also, I've been viewing my profiling stats with Open Office and I have to say their charting tool is faster than Numbers, which I had been using on OSX. Not suprisingly, I'm seeing spikes when the terrain textures are created. To help with this I think I will implement a resource pool for the textures and try any other optimizations I can think of.

Lastly, I fixed the yellow/black stripe texture bug. Ogre displays that yellow/black texture when there is some kind of error with the texture you tell it to display. In my case I was asking Ogre to use a texture that I had destroyed when a mesh was cached and had not recreated when that mesh was reclaimed from the cache. In the future I should make sure to cache everything a mesh needs along with the mesh to optimize the speed of reclaiming from the cache - of course there is a speed/memory usage issue as always.

Monday, June 22, 2009

Diffuse Texture Applied to Planet and Performance Issues

I finally modified the quadtree planet test program to use textures that are generated on the GPU for each mesh. It's a lot more intensive then I had hoped - each mesh uses its own height map, slope map, texture map etc. Basically I took the simplest approach and ran with it and the results showed that texturing is possible, but the performance is now a major issue. On OSX I was getting 300+ FPS near the surface and more when out in space, but in Windows Vista I've been getting 90 FPS in space and as bad as 4FPS at the surface. Not to mention the time it takes to update the meshes with the added strain of the texture generation is way too slow.

This screenshot near the surface shows the poor FPS and also how at the surface the textures are too blurred. I can increase them but performance is bad at the surface already.

I haven't started debugging the performance bottle necks because I just wanted to get it working first, but now I think I've arrived at a point where the performance is so bad that I can't continue without fixing it.

One thing I ran into that is Ogre related is that it is a bad idea to call manualRender() from within the _updateRenderQueue() function (possible recursion issues?) So I took all the mesh building functionality that had to happen in the main thread and moved them out of the _udpateRenderQueue() function and into the application update() function.

The only obvious remaining graphical bug I have yet to figure out is the yellow/black striped textures I'm getting when pulling away from the planet when cached meshes are being drawn (I'm probably releasing a texture and not re-creating it).

*sigh* I miss having a good looking planet so bad. Can't wait to get it pretty again.

Thursday, May 28, 2009

CEGUI Ugliness And UML Diagrams

Yes, I looked at the Taharez and Windows skins before settling for the "Vanilla" skin. Lesser of three evils no? Integrating CEGUI (Crazy Eddies Graphical User Interface) was somewhat painless but there are some definite quirks with the code and the layout generator tool that I ran into:

When you deactivate a window in CEGUI your subsequent calls to getActiveChild() will still return the deactivated window.

The layout generator tool undo system does not seem to work and the copy and paste functionality is also finicky.

The use of a data directory for the layout tool is not intuitive to me. It should just use its' own program install directory and save the .layout files where you indicate via the save file dialog - isn't that the main purpose of the tool? Why confuse your average n00b by asking them where they want their data directory to be when they install? It's not like you can easily change the data directory if you select the wrong location as I found out.

CEGUI window skins seem to stretch by default so I will eventually have to figure out how to make them pixel perfect and resolution dependent where needed.

The reason I needed a GUI was that setting the 8 different blend properties for the 16 different layers was going to be a pain to do (and slow) by keyboard alone. I also wanted to see how easy it would be to work in what seems to be the "go to" Ogre GUI (it gets used in several demos). The GUI allows me to move the sliders that adjust the values in the look up table (or I can type the values in by using the edit boxes and pressing 'Enter'). When values are adjusted the terrain texture is automatically updated and the screen refreshes to show the new texture. I use the mouse right click to toggle between GUI mode (when the mouse cursor is revealed) and fly mode (when the cursor is hidden and the mouse controls where you look).

After integrating the GUI I realized how complicated this little tool had become and that in order to un-complicate it and make it easier to integrate into the planet engine that I would need to break out the UML tools (I used OmniGraffle Pro) in order to get a good look at the architecture.

As you can (or cannot) see, it's a tad complicated for what looks so basic by the screenshot and I'm not even including all the getter/setter functions. To integrate this with the planet engine I think I will make the render to texture a singleton helper and if I bring the GUI over then I'll probably make a GUI class that receives input from the InputHandler and then translates the GUI actions into application actions. The InputHandler will be responsible for non-GUI actions like movement, screenshots, quiting the program etc.

Sorry this isn't a very exciting post, but it was necessary. I'm going to try and get the quadtree test program running on Vista and then try to integrate the new terrain texture process. See how I said that in one sentence? It will probably take me weeks to get running right.

Oh and I should say I'm excited to see that Steve Streeting (the main dev behind Ogre 3D) appears to be working on some terrain improvements to Ogre which I plan on dissecting. You can follow his tweets at http://twitter.com/sinbad_ogre. Mine are at http://twitter.com/petrocket.

Tuesday, May 12, 2009

Dynamically Generated Heightmap Textures

Everyone has to do several of these in their life I'm sure. I mean, implementing some kind of blend scheme for making a texture for a height field is terrain 101.

The particular method I've been working on is based on Ysaneyva's latest method. He generates a texture for each of his terrain patches using a shader that blends up to 16 textures based on a look up table. The inputs for the look up table are slope and height - read more about it at Ysaneya's dev journal post on atmospheric scattering and terrain texturing

My test implementation here uses Ogre3d as the graphics engine. I'm using the Octree scene manager plugin to do the quick heightmap - but I highjack the textures used by the heightmap and overwrite them with my dynamically generated texture. The process is as follows:

1. create heightmap with octree terrain manager
2. load the heightmap image and create a slope map from it
3. render a full screen quad that has the the texture generation shader which combines up to 16 textures based on the terrain height,slope. The height corresponds to the y value and the slope corresponds to the x value in the look up table.
4. write the render to texture into the texture used by the heightfield

Here's what my look up table looks like:
Look up table image

I'm still experimenting with look up table values and blending between altitudes and slopes. Also, I need to add some random noise to the altitude/slope values to mix things up a bit.

Of course the actual planet textures won't be so blurry, or if I do end up using this resolution I will probably implement some kind of detail map on top of it.

Here's the fragment shader (GLSL) for the texture generation:


void main()
{

    float altitude = texture2D(heightMap, gl_TexCoord[0].st).r;
    float slope    = texture2D(slopeMap, gl_TexCoord[0].st).r;

    vec2 uvSize = vec2(uvExtents.z - uvExtents.x, uvExtents.w - uvExtents.y);

    // diffuse UV should be the UV position relative to the face on the cube
    // if we just used gl_TexCoord[0] we would have UV position relative to
    // each quadrant and we'd get seams
    vec2 diffUV = vec2(uvExtents.x + (gl_TexCoord[0].s * uvSize.x), 
                  uvExtents.y + (gl_TexCoord[0].t * uvSize.y));

    // tile the textures so they look better in this test
    diffUV.x *= 10.0;
    diffUV.y *= 10.0;
 
    // get the 16 blending weights for this slope and altitude
    vec4 weights0 = texture2D(lutTex, vec2(slope * 0.25 + 0.00, altitude));
    vec4 weights1 = texture2D(lutTex, vec2(slope * 0.25 + 0.25, altitude));
    vec4 weights2 = texture2D(lutTex, vec2(slope * 0.25 + 0.50, altitude));
    vec4 weights3 = texture2D(lutTex, vec2(slope * 0.25 + 0.75, altitude));

    // use w,x,y,z order because PNG uses pixel format A8R8G8B8
    gl_FragColor = texture2D(diffTex0, diffUV)  * weights0.w +
                   texture2D(diffTex1, diffUV)  * weights0.x +
                   texture2D(diffTex2, diffUV)  * weights0.y +
                   texture2D(diffTex3, diffUV)  * weights0.z +
                   texture2D(diffTex4, diffUV)  * weights1.w +
                   texture2D(diffTex5, diffUV)  * weights1.x +
                   texture2D(diffTex6, diffUV)  * weights1.y +
                   texture2D(diffTex7, diffUV)  * weights1.z +
                   texture2D(diffTex8, diffUV)  * weights2.w +
                   texture2D(diffTex9, diffUV)  * weights2.x +
                   texture2D(diffTex10, diffUV) * weights2.y +
                   texture2D(diffTex11, diffUV) * weights2.z;
}

*edit* I should note that the above shader only combines 12 textures - of which I'm only using 8 for now - you should have no trouble adding the extra lines for diffTex12-diffTex15 and the weights3 vec4.

Sunday, April 12, 2009

Fast Normals .. Or Not

Well, I spoke too soon about the fast normals. They almost work. The normals do not line up exactly at the corners of the cube so there are visible seams. I spent some time trying to come up with some extra rotation to compensate for this distortion but was not successful and have decided to revert to calculating the normals the old way, which is by calculating triangle normals and then smoothing.

I believe to fix the cube to sphere mapping you need to rotate the normals about their vertical axis as they approach the corners, but I have not come up with a good method yet, if anyone has any ideas I'd love to hear them.

(click the above image to see the normal map corner distortion)

Thursday, April 9, 2009

Fast Normal Calculations for Planets

Up till now I've been using a brute force method of normal calculation for the patch normals. I would use the cross product to find the normal for each triangle and then smooth the triangle normals when calculation the per-vertex normals.

I was aware of the simplified normal calculations for flat heightfields but wasn't sure how to translate those to the sphere - until now. Aurelio Reis and I had been corresponding and I emailed him about how I didn't think his method for mapping normals to the sphere was correct (turns out I completely misunderstood his method) and I recommended translating the normal on the cube face to a normal on the sphere by getting the rotation matrix from the normal of the cube face to the vertex position on the unit sphere.

The fast method for deriving a rotation matrix given a start and an end vector can be found here: http://www.cs.brown.edu/~jfh/papers/Moller-EBA-1999/main.htm - Tomas Moller & John F. Hughes "Efficiently Building a Matrix to Rotate One Vector to Another and their sample code here: http://jgt.akpeters.com/papers/MollerHughes99/code.html

The funny thing is my implementation failed and I had given up, when Aurelio wrote me back with news of his success, and some helpful advice about calculating the heightfield normals. With that encouragement I attacked it again, fixed my bugs and got it working.

Thanks to Aurelio's suggestion, I'm using the Game Programming Gems 3 method for calculating heightfield normals. The tricky thing with this method of calculating normals is that it assumes the distance between the vertices is 1 unit. When the camera approaches the terrain and it splits the quadtree, the new patches are twice the resolution of the old patch, so the vertices are now 0.5 units apart. Thankfully the distance of the normal vector doesn't matter in the calculations so it simplifies things and you can see below that for level N of the quadtree to calculate a normal for a vertex at that level the equation is:

Normal = ( h3 - h1, 2.0 / (1 << level), h4 - h2 )

Where h1-h4 are the heights of the neighboring vertices, and level is the depth in the quadtree (starting at depth 0).

Here's what that equations looks like for levels 0, 1, and 2:

Level 0:
Nv = ( h3 - h1, 2, h4 - h2 )

Level 1:
Nv = ( h3 - h1, 1, h4 - h2 )

Level 2:
Nv = ( h3 - h1, 0.5 , h4 - h2 )

I normalise the normals before applying the rotation matrix that rotates them from cube space to sphere space.

Here's the basic code for calculating the normals (hMap is a RGBA image where the RGB values are the XYZ for the vertex position on the unit sphere and the A value is the height in the heightmap)


    // Get the Y factor for the normal calculations
    Real f = 2.0  / ((Real)(1 << getDepth()));
    Matrix3 m;
    
    float* hMap = (float*)mHeightMap->getData();

    // skip the outer edge and first vertex
    hMap += (patchSize+2 + 1) << 2;

    // loop through the vertices and save the normals to the normal map
    float* nMap = (float*)mNormalMap->getData(); 
    for(int y = 0; y < patchSize; ++y) {
        for(int x = 0; x < patchSize; ++x) {
            h1 = *(hMap + 7);
            h2 = *(hMap + ((patchSize + 2) * 4) + 3);
            h3 = *(hMap - 4 + 3);
            h4 = *(hMap - ((patchSize + 2) * 4) + 3);
                    
            spherePos.x = hMap[0];
            spherePos.y = hMap[1];
            spherePos.z = hMap[2];
        
            normal = Vector3( (h3 - h1), f, (h4 - h2)).normalisedCopy();
            
            buildRotationMatrix(Vector3::UNIT_Y, spherePos, m);
            normal = m * normal;
            
            *nMap++ = normal.x;
            *nMap++ = normal.y;
            *nMap++ = normal.z;
            
            hMap += 4;           
        }

        // skip edge vertices
        hMap += 8;
    }

Of course I plan to clean up the code when/if I make the loops OpenMP compatible.

Tuesday, March 24, 2009

Resource Managers

I've been looking for a fast, safe, memory efficient implementation of a basic resource manager that does a few things:

Uses handles instead of pointers

Supports reference counting so resources can be shared and freed properly

Doesn't fragment the heap, but instead, allocates memory in contiguous chunks so all the data in a resource "pool" is local

Is templated and simple

Releases resource handles by pushing them onto the back of the list and getting available handles off the front of the list so that the most recently used handles stay in the handles list longest for best cache results.

Here are the ones I considered in my research.

A Simple Fast Resource Manager using C++ and STL (Ashic Mahtab, Zinat Wali)

Pros:
Uses STL and templates, very simple easy to understand, uses reference counting and handles.

Cons:
Uses strings for unique identifiers, pointers to resources are stored in std::vector, not the resources themselves, uses a stack for handles (push and pops off the front).

Gem: A Generic Handle-Based Resource Manager (Scott Bilas)

Pros:
Uses STL and templates, simple to understand, uses handles, stores resources not resource pointers.

Cons:
Uses strings for unique identifiers, does not have reference counting for handles (handle sharing), does not use a queue for handles.

Ogre 3D Resource Manager

Pros:
Is built to work with Ogre resources, supports shared resources and handles (reference counting), supports loading states and threaded loading of resources.

Cons:
Uses strings for unique identifiers, does not use a queue for handles, stores pointers to resources.

All these resource managers are great for managing resources loaded from files, which is why they all use a string as an identifier. None of them were designed with storing on-the-fly generated terrain data in mind.

I ended up starting with Ashic Mahtab and Zinat Wali's manager code and modified it to use a std::deque for the handles, to use an unsigned long long as the unique identifier and to store the resources themselves and not resource pointers.

So after I coded that up I ran into some major problems:

The Ogre::VertexData class has its constructor set private so I couldn't subclass it to easily make it a resource. The solution was to store a shared pointer to it instead, which means that I lose my resource memory locality.

My vertices and blending vertices can be of variable size and my implementation was designed to store objects of known sizes (known at compile time). I could specialize the vertices manager class to handle this shortcoming, but I opted to just use a shared pointer instead.

Looks like I ended up basically storing pointers to resources instead of the resources so far!

The profile results show that I have brought the deletion time down from 251 microseconds average to about 5 microseconds average. Over-all the application update time and the _updateRenderQueue() function time seems to have decreased by at least half.

So I'm fairly happy with the results, the problem is that I'm still seeing spikes in the Ogre::Root::renderOneFrame(). The average time for that function is 4,000 microseconds, but the maximum time is around 40,000 microseconds.

In the future I will look into what is going on inside Ogre that is taking this time, but for now I'm going to continue on with my refactoring so this may be the last resource performance related post for a bit.

Friday, March 6, 2009

Shared Vertex Buffers Cont.

The plot doesn't thicken, it plods on like man in a desert holding a can of pudding and desperately looking for a opener. Actually, it isn't like that at all, but when it gets late you start to get more "poetic" and silly.

Profile Data On Swivel! The blue line is the time taken by renderOneFrame() (an Ogre3D function). The yellow line is the time the application takes to update the quad tree and the brown is the amount of time within renderOneFrame() that is taken up by my application updating the render queue and creating/updating vertex/index buffers. The vertical axis is time in milliseconds and the horizontal axis is time in seconds. This chart only shows the main thread and does not include the builder threads that create the mesh data including the heightfields and normal maps.

I uploaded some profile data for the test application that runs in 3 different modes to Swivel, which is a free online data/graphing tool that is in "preview" aka beta. I needed to find a better graphing tool than iWork's Numbers, which takes 10+ minutes to make graphs of data with more than a handful of rows and columns ugh.

I've created a class that encapsulates all the functionality for managing the vertex buffers for the terrain. The class has 3 managing "modes". It can have all the meshes use a large single vertex buffer, or multiple vertex buffers. When using multiple vertex buffers it can either cache those buffers for later re-use, or immediately delete buffers that are no longer used.

When running in single buffer mode, or multiple shared buffer mode, every mesh that is in the visible list and the visible build list has space in the buffer. Every mesh in the cache list no longer has a guaranteed space in the buffer - however if the cached mesh is still around when we need it later the vertex buffer manager will check to see if that mesh's data is still in the vertex buffer somewhere.

The nice thing about sharing the vertex buffers is that it means I'm not deleting them like crazy, and because I create enough space to hold a good number of meshes when the program loads I'm not creating buffers like crazy (or expanding them in the case of the single buffer).

Conclusions:
Now that I have coded these 2/3 methods up, I need to do some further testing because so far the only definitive results are that all 3 versions exhibit random spikes in renderOneFrame() that I think are just related to panning the camera around which updates the frustum culling, and that the FPS drops from 310 to 290 when switching from STATIC to DYNAMIC vertex buffers. I plan on making the camera follow a specific path and output profile data for each method using that path so I can better compare the results. Right now it's just me flying down to the surface and then randomly visiting spots on the surface.

Wednesday, March 4, 2009

Shared Vertex Buffer vs. Multiple Vertex Buffers

I went ahead and implemented a shared vertex buffer for all the terrain patches and did some profiling and found that updating the shared vertex buffer takes 6 times longer than creating a new buffer and filling it. Also, when using a shared vertex buffer the fps drops from about 240 to 200 - I'm guessing this has to do with using a dynamic versus a static_write_only buffer. I tried setting the shared buffer type to HBU_DYNAMIC and HBU_DYNAMIC_WRITE_ONLY and used HBL_NO_OVERWRITE for my lock type when writing vertex data. I had 16 unique index buffers (for the 16 different index orders) and each patch used one of those 16 buffers so those didn't change - I just had to update the vertex buffer every time I wanted to add a new patch.

So, say for example that my program samples the longest amount of time it takes to create a vertex buffer and fill it - that took on average about 1 ms in my tests, then the time it took to lock part of an existing buffer (the shared buffer) and update it - that took on average 6 ms.

The funny thing about it all is that when NOT using the shared vertex buffer the program seems to "stutter" more - and I haven't nailed down why it does that, but my current theory is some kind of buffer management overhead in Ogre when I created/delete all these buffers, because I've profiled most of my code. When I use the shared buffer the program seems to run smoother with less stuttering, but the fps drops. The stuttering only occurs when flying over the terrain and it is changing LOD often.

My (final?) idea is going to be to go with creating vertex buffers per patch - however, instead of deleting vertex buffers when a patch gets deleted, I'm going to try and recycle that buffer by giving it to another patch to use. My hope is that using HBL_DISCARD on that existing buffer will be faster than creating a new buffer for the new patch and filling it - and I know I will gain however much time it takes to delete a vertex buffer.

Not that the program stuttering is bad - it is MUCH improved since moving the build process to a separate thread, and since throttling the number of buffers created/deleted/filled per frame.

*EDIT*

Just did some quick profiling test and can confirm that when using the single shared vertex buffer the time taken by _renderOneFrame() is about 6ms on average - 12ms worst case, and most of that time is reflected in the updating of the shared vertex buffer. However _renderOneFrame() averages 12ms - 80ms worst case when using static buffers and most of that time is NOT reflected in creating/deleting/filling the vertex buffers. Now if Xcode would display Ogre source code I could better nail down where in Ogre the overhead is coming from, but at least now I can see why the app is indeed smoother and less CPU intensive when using the shared buffer - though fps is lowered a bit.

Tradeoffs *sigh*(tm)

Hopefully my shared multiple vertex buffers idea will take care of both issues.

Saturday, February 21, 2009

Blending Vertex Normals In A Quad Tree

Ogre 3D Vertex Normal Blending from petrocket on Vimeo.

I've finally had some time to work on blending the mesh normals and can post a video - my FIRST video of the app so go easy - and talk about what works about it and where the idea came from.

As it turns out there is/was another excellent planetary LOD program based on geo clipmaps and I read that Lutz Justen (the author, now an engineer at Red 5 Studios!) had blended his normal maps and done some other fancy things like cloud map shadows. I contacted Lutz and he responded with this:

"Vertex normals: I don't think there is *any* way to make vertex normals not pop - UNLESS you change the way geo morphing works. You probably interpolate a vertex to the position between the two higher-level verts, i.e.
x
/ | \
/ \/ \
x---o----x

The upper vertex is interpolated to the average of the lower 2 vertices. In that case, you'll always have lighting pops. If you instead interpolate the upper vertex to either the left or the right vertex, you'd get rid of the pops because the limit of the finer triangles as you interpolate to the coarser triangles actually match the coarser triangles. I hope that makes sense. If you interpolate to the average of the lower vertices, the limit doesn't match the coarser representation. It's the same geometry, but a different triangulation.

If you store per-patch normals as Ysaneya does, there's another way. You can store your normal maps in object space (instead of tangent space) because they are unique anyway. Then you can lerp between a fine-level normal map and the corresponding coarse-level normal map. That'd make your lighting completely smooth. That's what I did in my demo, although the resolution of the normal map was very low - it was the same resolution as the vertices! But nevertheless, the lighting was completely continuous."

And that's what I tried to do, only right now my normal map calculations for the "coarse" map are not entirely correct because there is still a pop when the mesh level changes - although a MUCH smaller pop than before, it is still there as you can see in the video. Maybe sometime I will post a video of the transitions without normal blending so you can see the improvement.

Here's how I implemented the normal map blending in Ogre (cooking directions style!)
Ingredients:
6 quad trees
4 quad tree meshes per node
2 sets of normals per mesh (for blending)
1 blend start distance for each level/depth in the quad tree
1 blend end distance for each level/depth in the quad tree
1 GLSL vertex shader to take these values and do something with them

1. prepare your meshes by creating the 3d vertex positions from your 3d noise function
2. calculate your end normals based on those positions and save them - these are the normals we will blend to.
3. get the outer edge vertices for the "coarser" mesh
4. calculate the normals for each vertex that exists in the "coarser" mesh
5. calculate all the interpolated normals for each vertex that exists in the "finer" mesh but not in the "coarser" mesh.
6. in the vertex shader blend between the "finer" and "coarser" normals based on the camera's distance to the vertex. Each level of the quad tree will have a different distance based on the split distance for that level - the split distance is the distance from the camera to a node at which it must split into 4 sub nodes, one for each quadrant.

Right now my calculations are not completely correct for step 5 because there is still a visible transition or "pop" when the finer mesh is displayed. I don't know where my calculations are wrong, but if I figure it out I will post the solution.

Saturday, February 14, 2009

Multithreaded Graphics Programming Caveats

You can't make graphics calls in any thread but the main thread without sacrificing platform portability and graphics library portability. In my case I want to be able to run this application on Windows in DirectX 9+ and in OSX and Linux in OpenGL. OpenGL allows you to make graphics calls in a secondary thread if you copy and initialize the graphics context in that other thread, but even if I did that I would have to write separate procedures to initialize the secondary threads for OpenGL and DirectX and none of it would be pretty and from what I've read the gains wouldn't be that significant.

So my compromise is to use POSIX threads and OpenMP for the non-graphic call related stuff - like the heavy lifting math in the terrain generation and normal maps etc. All the graphics related calls will live in the main thread (creating vertex/index buffers and populating them with data, and releasing vertex/index buffers).

I've noticed that creating an Ogre VBO and populating it with data can take up to 700 microseconds (but averages around 165 microseconds), while deleting an Ogre VBO takes 10 times as long (max 8 milliseconds, avg 1 millisecond). By only creating or deleting one VBO per frame, and after moving all the vertex calculations to separate threads, I'm able to keep things pretty smooth. Did I mention I'm using my MacBook Pro? It has an Intel Core II Duo 2.4ghz with 4GB of RAM and the NVidia GeForce 8600M GT with 256MB VRAM.

At some point I will need to identify why freeing the buffer is taking so much more time, and try to optimize that.

Here's the biggest caveat with threads that I've run into: I cannot reliably allocate memory inside a child thread. Ogre uses nedmalloc now and it is a fantastic allocator/deallocator. Unfortunately when I try to allocate memory from within a thread other than the main thread I get an EXC_BAD_ACCESS signal from within nedalloc::GetThreadCache() The annoying thing is that it doesn't happen right away - it's only after the program generates enough meshes. So for now I'm steering clear of any allocation/deallocation inside threads and keeping that in the main thread. It doesn't look like keeping it in the main thread is causing any performance hit anyways.

Thursday, February 12, 2009

POSIX Threads And OpenMP Can Co-Exist

Not exactly a major revelation, but as an experiment I changed the mesh build process to a threaded model using POSIX threads (pthreads). The short version is that when the quad tree changes, first a manager thread is spawned, which spawns several worker threads. The worker threads build the actual mesh and then quit when there are no more meshes to build. The manager thread just waits for all the worker threads to finish then updates the render queue with the new list of meshes and quits.

One important thing to mention, besides all the mutexes to protect the lists, is that the quad tree is not allowed to be updated while the new meshes are being built. Changing the quad tree while still building would invalidate our list of meshes to build and could get us into a situation where the render queue is never updated because the list of meshes to build is never completed!

After the meshes are all built and the new visible list of meshes is given to the render queue then the quad tree is allowed to update, and when the tree changes the whole process begins again.

Also note that while the worker threads are building the new mesh, the render queue continues to display the meshes from the previous version of the quad tree.

The second part of the experiment was to use OpenMP to utilize the extra core of my processor in building the meshes using work sharing. Fortunately, XCode supports GCC 4.2 which supports OpenMP 2.5, and that will suffice for now. I only added parallel processing to the for loop that generates the vertices so I don't expect to see much improvement, however when I start generating normal maps I think the improvement will be noticable. Anyways, it compiles and runs.

Next I have to fix some logic bugs, re-implement the code that deletes old meshes in the cache list, and either re-implement the heightfield and normal generating, or take a crack at a second quad tree that is updated based on where the camera will be so those meshes can be pre-built and put in the cache.

A small aside - making the build process threaded made my simple profiler not completely reliable anymore because the thread may start sampling during one frame and then before it is done sampling, the profiler might attempt to process the samples because that part of the profiler runs in the main thread! Fortunately that is unlikely to occur and if it does it shouldn't skew the results much, though I suppose I could add a semaphore to resolve the issue.

Tuesday, February 3, 2009

Quadtree Mesh Management And Caching

Sometimes you can get away with half decent code when dealing with 2d terrain, but you will be punished if you try that with spherical/planet terrain.

Now that my feet are wet I've returned to the basics in order to deal with some of core issues like implementing a mesh manager to handle meshes should be visible, those that need to be built, and those are built that should be cached. I also created a customizable QuadTreeSphere class that I plan on using for the water mesh, the atmosphere and the planet surface. I now can specify the min and max node levels allowed in the quadtree, the patch size for each mesh, the max cache size for each mesh and things like whether the mesh will use normals or textures or tangents or even a heightmap.

The new system doesn't do anything but basic terrain now because I've been focusing on the mesh management and caching.

Mesh Management

I've implemented a separate manager for each QuadTreeSphere that has four std::map objects that hold pointers to the mesh objects and the key for each element in the map is the unique ID for that mesh.
1. A visible map that holds all the current visible meshes.
2. A visible build map that holds all the unbuilt meshes that are needed for immediate display.
3. A cache map that holds all the non-visible, but built meshes that we might need soon.
4. A cache build map that holds all the unbuilt meshes that we might need soon.

I also have a FIFO (first in first out) list that contains all the id's of all the meshes in the cache map. If the FIFO list ever gets larger than the max cache size then I start removing meshes from the list - and from the cache and deleting them. This list is used to delete the least recently used meshes. Of course if any mesh in the cache is moved to the visible map then it must also be removed from the FIFO list, and when that mesh is moved from the visible map to the cache map it will go on the front of the list.

Some other rules I have found that I needed to implement:
1. All meshes in the visible build map must be built immediately or you get flickering where you have not built a mesh (duh)
2. Only build one mesh from the cache build map per frame and only if no meshes in the visible build map were built because building meshes takes a lot of time (relatively).

I added a simple profiling class because I know I will need to compare build optimizations and also I needed to get a good idea for how much time each major component takes. Clicking on the chart below takes you to the larger version.

Quadtree Pre-Caching

My simple manager has a bottleneck right now and that is when the camera gets close to the surface all these new meshes are added to the visible build list and must be immediately built for immediate display. There are a couple things you can do to resolve this issue and the simplest one is to do pre-caching. My simple method is to take the closest quadrant from the closet node at the current deepest level of the tree and pre-cache the four meshes for that quadrant (which we will have to split anyways if the camera get's closer to the surface). The second simple thing to do is to pre-cache the meshes for the neighbors of the closest node at the current deepest level of the quadtree (if those neighbor nodes don't already exist).

Below is a diagram of this:

The Pink square represents the closest node to the camera at the current deepest level in the quadtree. The four orange squares are the neighbors of this node.
The purple square represents the quadrant that is closest to the camera in the pink square.

Under my simple caching scheme we know that if the camera moves closer to the surface we must split the purple quadrant - so the first thing to do is to build those meshes and save them in the cache until they're needed. If the camera moves left or right along the surface but not closer to the surface then we'll need the meshes represented by the orange squares, so build those and add them to the cache.

If the camera moves further away from the surface the larger meshes should still be in the cache because every mesh that was visible is added to the cache after it is not needed.

Pros of this scheme:
1. Simple and works without modifying the quadtree
2. This caching scheme works fine when the camera is tracking a player running on the surface, where they can change direction and speed rapidly and frequently.

Cons of this scheme:
1. If the camera moves in a diagonal fashion along the surface, the diagonal nodes will not have their meshes in the cache (green nodes)
2. This scheme is ignorant of the camera's current speed and direction, which is more useful information when dealing with fast moving objects above the surface, where the direction and speed won't change as rapidly.

I plan on implementing another cache scheme based on speed and direction where I have a second quadtree that represents where the camera will be in the next nth of a second, and based on that tree, cache all the meshes that would be visible. According to my profiling statistics, the updating of the quadtree doesn't take nearly as much time as building meshes, so it might be worth the extra time to implement this pre-cache scheme and get better cache results.

Lastly, I implemented frustum culling for my barebones quadtree system and it helps the FPS a lot! It cut down the number of triangles from about 200k to about 90k.

I plan on doing some experiments with making the mesh manager threaded to better utilize the cpu.

Saturday, January 10, 2009

Atmosphere Debut

Oh atmosphere with your light scattering and all your fancy equations how you kill me. I poured over a lot of white papers and Sean O'neils atmosphere shaders, but I just couldn't get them to work right. I ended up using a simplified version of his sky shader, and came up with my own shader for the ground. The effect isn't ultra-realistic, but that isn't what I'm going for. I would like the shader to be flexible, but for now it gets the idea across that this is a fantasy-esque environment and that you are meant to play.

I added starlight to the dark side of the planet so players will still be able to accomplish things at "night". Still might need to add more moonlight though or make night time artificially short by speeding up the planet rotation or something.

I did make the planet rotate and the light rotate which is nice and now it looks like I have a glorified screensaver. Huzzah.

On the todo list for a few things to tweak next are: terrain look up table tweak and texture tweak, add some noise to the table, but try to preserve a fantasy/cartoony look, add a sun glow in the sky so I can tell where it is and make the mie scattering line up with that sun glow. I'd also like to make the water not look so deep by making it aware of the height of the terrain under it, but I might update the water mesh before I get to that -right now it is just a simple sphere primitive with no LOD and the UV is very squished at the poles.

More random pics:

I need to ask Mr. O'Neil if it is OK for me to share the modified shaders, but I imagine he'll say yes. In other news be sure to check out Ysaneya's latest dev journal post to see what a real pro can do! I just noticed he's pushing 500k tri's with more complex shaders than me and a MUCH larger planet, and I'm only doing 80k tri's with less complex shaders and a small planet and his framerates are higher *sigh*