Saturday, July 30, 2016

dirty tricks for moar fps!

This weekend I landed a patchset in mesa to add support for resource shadowing and batch re-ordering in freedreno.  What this is, will take a bit of explaining, but the tl;dr: is a nice fps boost in many games/apps.

But first, a bit of background about tiling gpu's:  the basic idea of a tiler is to render N draw calls a tile at a time, with a tile's worth of the "framebuffer state" (ie. each of the MRT color bufs + depth/stencil) resident in an internal tile buffer.  The idea is that most of your memory traffic is to/from your color and z/s buffers.  So rather than rendering each of your draw calls in it's entirety, you split the screen up into tiles and repeat each of the N draws for each tile to fast internal/on-chip memory.  This avoids going back to main memory for each of the color and z/s buffer accesses, and enables a tiler to do more with less memory bandwidth.  But it means there is never a single point in the sequence of draws.. ie. draw #1 for tile #2 could happen after draw #2 for tile #1.  (Also, that is why GL_TIMESTAMP queries are bonkers for tilers.)

For purpose of discussion (and also how things are named in the code, if you look), I will define a tile-pass, ie. rendering of N draws for each tile in succession (or even if multiple tiles are rendered in parallel) as a "batch".

Unfortunately, many games/apps are not written with tilers in mind.  There are a handful of common anti-patterns which force a driver for a tiling gpu to flush the current batch.  Examples are unnecessary FBO switches, and texture or UBO uploads mid-batch.

For example, with a 1920x1080 r8g8b8a8 render target, with z24s8 depth/stencil buffer, an unnecessary batch flush costs you 16MB of write memory bandwidth, plus another 16MB of read when we later need to pull the data back into the tile buffer.  That number can easily get much bigger with games using float16 or float32 (rather than 8 bits per component) intermediate buffers, and/or multiple render targets.  Ie. two MRT's with float16 internal-format plus z24s8 z/s would be 40MB write + 40MB read per extra flush.

So, take the example of a UBO update, at a point where you are not otherwise needing to flush the batch (ie. swapbuffers or FBO switch).  A straightforward gl driver for a tiler would need to flush the current batch, so each of the draws before the UBO update would see the old state, and each of the draws after the UBO update would see the new state.

Enter resource shadowing and batch reordering.  Two reasonably big (ie. touches a lot of the code) changes in the driver which combine to avoid these extra batch flushes, as much as possible.

Resource shadowing is allocating a new backing GEM buffer object (BO) for the resource (texture/UBO/VBO/etc), and if necessary copying parts of the BO contents to the new buffer (back-blit).

So for the example of the UBO update, rather than taking the 16MB+16MB (or more) hit of a tile flush, why not just create two versions of the UBO.  It might involve copying a few KB's of UBO (ie. whatever was not overwritten by the game), but that is a lot less than 32MB?

But of course, it is not that simple.  Was the buffer or texture level mapped with GL_MAP_INVALIDATE_BUFFER_BIT or GL_MAP_INVALIDATE_RANGE_BIT?  (Or GL API that implies the equivalent, although fortunately as a gallium driver we don't have to care so much about all the various different GL paths that amount to the same thing for the hw.)  For a texture with mipmap levels, we unfortunately don't know at the time where we need to create the new shadow BO whether the next GL calls will glGenerateMipmap() or upload the remaining mipmap levels.  So there is a bit of complexity in handling all the cases properly.  There may be a few more cases we could handle without falling back to flushing the current batch, but for now we handle all the common cases.

The batch re-ordering component of this allows any potential back-blits from the shadow'd BO to the new BO (when resource shadowing kicks in), to be split out into a separate batch.  The resource/dependency tracking between batches and resources (ie. if various batches need to read from a given resource, we need to know that so they can be executed before something writes to the resource) lets us know which order to flush various in-flight batches to achieve correct results.  Note that this is partly because we use util_blitter, which turns any internally generated resource-shadowing back-blits into normal draw calls (since we don't have a dedicated blit pipe).. but this approach also handles the unnecessary FBO switch case for free.

Unfortunately, the batch re-ordering required a bit of an overhaul about how cmdstream buffers are handled, which required changes in all layers of the stack (mesa + libdrm + kernel).  The kernel changes are in drm-next for 4.8 and libdrm parts are in the latest libdrm release.  And while things will continue to work with a new userspace and old kernel, all these new optimizations will be disabled.

(And, while there is a growing number of snapdragon/adreno SBC's and phones/tablets getting upstream attention, if you are stuck on a downstream 3.10 kernel, look here.)

And for now, even with a new enough kernel, for the time being reorder support is not enabled by default.  There are a couple more piglit tests remaining to investigate, but I'll probably flip it to be enabled by default (if you have a new enough kernel) before the next mesa release branch.  Until then, use FD_MESA_DEBUG=reorder (and once the default is switched, that would be FD_MESA_DEBUG=noreorder to disable).

I'll cover the implementation and tricks to keep the CPU overhead of all this extra bookkeeping small later (probably at XDC2016), since this post is already getting rather long.  But the juicy bits: ~30% gain in supertuxkart (new render engine) and ~20% gain in manhattan are the big winners.  In general at least a few percent gain in most things I looked at, generally in the 5-10% range.


  1. How does Freedreno compare to the Qualcomm proptietary blob (features, performance)?

  2. features is relatively easy, you can use the glxinfo page to compare:

    the tl;dr: version is a3xx is basically equal (gles 3.0 support for both), a4xx is a bit more behind (blob has gles 3.1, mesa at gles 3.0). However with mesa, both a3xx and a4xx have support for desktop gl 3.1 and glx, which the blob driver does not. And mesa also has support for linux, where blob is only available for android.

    Performance is a harder question, for lack of any good way to do an "apples to apples" comparison. Ie. no blob for linux. Since freedreno is running on android, we could possibly do some comparison there. Although still a bit limited (for example gfxbench would be the obvious thing to compare, except the free version doesn't run on unofficial/aosp builds.. and even there we'd be comparing freedreno + upstream kernel vs blob with downstream kernel).

    Just looking from the perspective of performance related tricks that are not implemented in freedreno yet, I'd guess there is room for another 10-20% gain, maybe more in some edge cases.

    1. > blob is only available for android

      Sorry for amateur question, but aren't the blob is just a driver for Android kernel, whereas the kernel is Linux (with some minor patches, but I haven't heard of graphics related).

    2. Yagamy Light, no it is a lot more than that.

      On the userspace side, the closed source blob driver does not have support for x11 or wayland, or desktop OpenGL.. so it isn't really possible to compare performance on something like xonotic or supertuxkart.

      On the kernel side, it is more like 1-2million lines of patches for the downstream android kernel, although not all graphics related (but many would effect graphics performance indirectly, ie cpu and memory bus scaling). And a very different kernel driver for the gpu.

      We could perhaps compare some things on AOSP with upstream kernel + freedreno vs downstream kernel + blob. One would have to either port kgsl (android kernel gfx driver) to upstream kernel or backport drm/msm do downstream kernel to decouple the effect of the rest of the kernel vs gfx driver stack. And I think glmark2 is about all we could compare (gfxbench refuses to run on "unofficial" android builds)