Tile-Based Deferred Renderers (TBDR) Pt 1

Background

Mobile GPUs work differently from traditional discrete GPUs. For a fun overview, ARM has a manga guide to their mobile GPUs, The Arm Manga Guide to the Mali GPU. Mobile GPUs rasterize binned geometry in discrete sections of the framebuffer (called "tiles") independently, rather than process primitives across the entire framebuffer. These tiles might be as small as 8x8 or 16x16 pixels. In Psuedocode, the difference between mobile GPUs and destop GPUs comes down to

Desktop

for (var triangle in triangles):
for (var fragment in triangle):
rasterize(fragment)

Mobile

for (var tile in tiles):
  for (var triangle in triangles where triangle in tile):
    for (var fragment in triangle):
      rasterize(fragment)

While the mobile approach seems more complicated, it allows for more efficient memory access patterns. In the desktop approach, a triangle that coverages a large % of the framebuffer will require writes across large regions of the backing memory. By binning the geometry by tile, the GPU enforces a localization of writes. A second important difference: in the desktop case, the writes and reads happen from video memory. Mobile GPUs add an additional on-chip cache that is only loaded and flushed once during processing of the tile. This further improves efficiency due to the locality of this cache. Expanding on the psuedocode from above, we can add the required load and store between tile and video memory.

for (var tile in tiles):
  var tile_memory = load(video_memory, load_action):
  for (var triangle in triangles where triangle in tile):
    rasterize(fragment, tile_memory)
  store(video_memory, tile_memory, store_action)

Much of the performance work for mobile focuses on exploiting the usage of the tile memory. For example, Multisample Anti-Aliasing (MSAA) is an anti-aliasing technique where N coverage samples are stored per pixel. At the end of a render pass, the N coverage samples per pixels are "resolved" (averaged) into a single sample. On a traditional desktop GPU 4x MSAA requires allocating textures that are 4x the size of a non-MSAA texture. on a mobile tile-based deferred renderer (TBDR), the 4x samples are only stored in tile memory and are resolved during the flush to video memory. As a result, 4x MSAA is "free" (from a memory perspective) on TBDR architectures.

Working with Tile Memory

Tile memory can be used as the backing memory for attachments that are used during a render pass but do not need to be stored. Depth/Stencil attachments are frequently discarded at the end of a render pass and are a great candidate for usage in tile memory. Like I mentioned above, MSAA normally requires the allocation of a color attachment that is n-times as big, but tile memory can be used to store it instead.

In Both Metal and Vulkan, we can use tile memory for an attachment by allocating special kinds of Textures. In Metal, this is called a "Memoryless" texture and in Vulkan it is called a "Lazily allocated" texture. The difference: Vulkan allows the drivers to spill to video memory if there is insufficient tile memory. Impeller refers to both of these as "transient" attachments. The following attachment configuration creates two textures, one resolve texture and a 4x MSAA transient attachment. This won't actually allocate any storage space for the MSAA texture. instead this reserves a portion of the tile memory, approximately tile_size * pixel_format_size * 4 bytes.

TextureDescriptor resolve_desc;
resolve_desc.storage_mode = StorageMode::kDevicePrivate;
resolve_desc.type = TextureType::kTexture2D;
resolve_desc.sample_count = SampleCount::kCount1;
resolve_desc.format = PixelFormat::kB8G8R8A8UNormInt;
resolve_desc.size = {400, 400};
auto resolve_texture = allocator.CreateTexture(resolve_desc);

TextureDescriptor msaa_desc;
msaa_desc.storage_mode = StorageMode::kDeviceTransient;
msaa_desc.type = TextureType::kTexture2DMultisample;
msaa_desc.sample_count = SampleCount::kCount4;
msaa_desc.format = PixelFormat::kB8G8R8A8UNormInt;
msaa_desc.size = {400, 400};
auto msaa_texture = allocator.CreateTexture(msaa_desc);

ColorAttachment color0;
color0.texture = msaa_tex;
color0.clear_color = Color::Black();
color0.load_action = LoadAction::kClear;
color0.store_action = StoreAction::kMultisampleResolve;
color0.resolve_texture = resolve_texture;

And of course, any additional attachments that we create which are memoryless would be backed by tile memory ... right up until the point at which we run out of tile memory. Apple helpfully provides a table of how much tile memory each generation of their TBDR GPUs has, whereas with Vulkan devices I'm not aware of any technique to figure out when the attachments will begin spilling to video memory. Some of the early generation Android Vulkan devices don't seem to have a heap for lazily allocated memory. This either means that they aren't TBDR (unlikely), or only have enough tile memory to store the single sampled color attachment of the default framebuffer pixel format.

Tile memory is limited in how data can be accessed from it. Because there is only ever a small region of the framebuffer that is active in the tile memory at time. A convolution filter that samples from adjacent pixels cannot be expressed as an operation over attachments: we cannot guarantee that the adjacent pixel is in the current tile. This isn't really an issue for normal usage like multisampling, depth/stencil, or lighting passes.

Of course, we can do a lot of fun things with tile memory too, but I'll save that for another post.

Tile-Based Deferred Renderers (TBDR) Pt 1

Background

Working with Tile Memory

References