Update: 18th of June 2012

After a long brake I decided to update the project, mostly with aim to learn some new stuff and try new ideas. There are three important changes. First one is that the renderer can now use not only CUDA to speed up the computations but also DirectCompute. Secondly, the CUDA-supported renderer does not perform any copy of the final image to the CPU; the final rasterized picture is rendered directly from the GPU memory using OpenGL interoperability (PBO and texture buffer objects). The third major change is the way the GPU-based rasterization is done.

Previous version of Vainmoinen sliced each triangle into tiles of 16x16 pixels. Two variations were available. The first one processed each triangle with a single CUDA kernel call. The second one could batch more triangles using sort of "virtual tiling". This way up to 65535 triangles' tiles could be processed with a single kernel call. The main problem with this approach turned out to be synchronization. Since triangles can overlap in screen-space, so can the tiles they are made of. This caused occasional flickering of pixels due to race condition.

The new hardware rasterization algorithm also uses tiling approach but now the screen is tiled instead of triangles. Each tile is assigned to a unique portion of the screen occupying 16x16 pixels area. The renderer can schedule up to a fixed number of triangles to be processed with a single kernel call. The data of triangles that are to be rendered in a particular kernel call are stored in GPU memory (and are submitted there by the CPU every frame). Moreover, for each screen's tile there is a list of indices to those triangles that affect the tile in mind. The algorithm simply iterates over the triangles and rasterizes their pixels.

New features Performance

One test was conducted. Three full-screen quads were rendered on top of one another (sorted back-to-front so no early-z). The screen resolution was 1366x768. The GPU used was GeForce GT540M. Performance stats:
1.163ms clear buffers (kernel) - 0.412ms
triangles buffer (copy) - 0.001ms
tiles' indices to triangles buffer (copy) - 1.500ms
rasterize (kernel) - 8.653ms
OpenGL's timing includes the whole pipeline. However, since there were only 12 vertices to process, the time needed for vertex processing can be neglected.
GL_ARB_timer_query extension was used to measure the OpenGL's time. NVIDIA's Compute Visual Profiler was used to get CUDA's timings.

Downloads Remarks Issues

1st of February 2011

Vainmoinen is a project that has been developed for my bachelor thesis "Software Renderer Accelerated by CUDA Technology". As can be easily concluded from the thesis's title, Vainmoinen is a software renderer that can use CUDA technology to speed up the rendering. The renderer supports a very small selected subset of OpenGL/Direct3D functionality. Here's the list: Downloads Remarks