Draw calls performance bottleneck #1026

bvssvni · 2016-01-11T15:22:02Z

This issue is to help people understanding the picture of what causes a performance bottleneck in piston-graphics, and what the plan is to fix it.

For each draw call, the CPU need to send data to the GPU. The GPU is often very fast at rendering. If this capacity is not used fully, the GPU sits and waits for more input from the CPU.

GPUs are designed for handling massive amounts of data with a limited set of variation. What the GPU does is controlled through a shader language. For OpenGL the shader language is GLSL.

When you render a rectangle, this is what happens:

Transformed triangles are created on the CPU in chunks and sent to the graphics backend.
The backend writes the received data to dynamic buffers.
The graphics driver tells the GPU to render using the updated buffers.
The GPU renders using a precompiled shader and paint pixels in the frame buffer.
The frame buffer is swapped with the current one to update the display

Step 1-4 happens repeatedly when drawing many objects for each frame.

In the Gfx backend the draw commands are collected upfront and given to the driver at the same time. However, from the graphics driver side, the instructions seems similar to the ones generated by the OpenGL backend (except for changes made the draw state).

The 1st step is done by piston-graphics's design. Reasons to triangulate on the CPU:

take advantage of f64 precision of matrix transformations, which are not that frequently supported on the GPU
non-linear transforms, such as least square deforming
easy to implement a new backend, with potential for one using software rasterization
great flexibility for the generic libraries depending on piston-graphics

Some questions one might ask:

Is the 1st step or 2nd step the primary cause of the bottleneck?
Is it possible to reduce the overhead without making design tradeoffs?

Before making changes to the design, one might consider using the strengths it offers to fix the problem. It seems the largest overhead is the number of draw calls, and since reducing the number of draw calls will lead to less overhead, we should looks for ways to do that first. This happens in the 2nd step, not the 1st!

Batch, batch, batch!

The key insight here is that since piston-graphics triangulates on the CPU, we could pack multiple shapes into the same buffer in the backend. This leads to fewer draw calls when:

The draw state is the same between calls
The same shader program is used (colored vs textured)
The same texture is used

One downside is that many backend instances leads to higher memory usage. Based on experience so far most applications only use one instance, so I do not think this is a problem.

For example, in Conrod a lot of solid colored shapes are rendered, then some textured shapes (text) and then more solid colored shapes etc. Currently the CharacterCache backends rasterizes glyphs using Freetype for each character in a separate texture. This means we can reduce the number of draw calls for solid shapes, but not for text.

In the case of text, we could try two different approaches:

Pack glyphs in a single image and update a texture
Since glyphs are often of similar size, consider using texture arrays

Number one seems sensible to test first because it would benefit from the same reduction of draw calls. However, it requires some changes:

Character should take &'a T to the texture, separating offset and size from texture storage internally in the glyph cache
Change CharacterCache::character to return Character<'a, T>

Alternative: Retained API

By organizing graphic primitives into a tree structure, one can traverse it and optimize the draw calls.

While this would be very interesting to work on, there are some major obstacles/unknowns:

One might want to use one tree structure for both 2D and 3D
Judging from previous projects, such as Ogre3D, writing a tree structure for 3D is complicated and might take years to mature, and would benefit from new language features in Rust to be extensible
It is not obvious why a retained API should be much faster than an immediate design writing directly into large buffers, because it depends on the method of sorting/optimization and input data
Seem to be a better idea to design a retained API based on project experience, rather than to fix a single optimization problem with 2D

Summary of plan

Change design of Character and CharacterCache
Write to larger buffers in the backends

I believe this plan requires minimum effort and least amount of breaking changes. We keep the same overall design of piston-graphics and the existing benefits.

The text was updated successfully, but these errors were encountered:

mitchmindtree · 2016-01-11T15:38:04Z

Sounds great 👍

bvssvni · 2016-02-01T21:18:01Z

This also requires changing the shaders from using a uniform color to one color per vertex. Triangles from different shapes gets packed into the same buffer, so their color must be separated.

crumblingstatue · 2016-04-03T06:40:21Z

I absolutely love https://love2d.org/wiki/SpriteBatch.
It would be nice to have a similar feature in Piston.
It's kind of off-putting when your Rust game runs slower than Lua because of the drawing overhead.

bvssvni · 2016-04-03T18:31:45Z

@crumblingstatue Can you open a new issue about it? Thanks!

crumblingstatue · 2016-04-03T18:42:43Z

@crumblingstatue Can you open a new issue about it? Thanks!

Alright, I opened #1041.

ishitatsuyuki · 2017-03-23T09:21:49Z

I've found the text renderer horrible. The minimal overhead is about 23 calls/frame (rusttype's gpu_cache example).

However, Piston doesn't batch it at all, do many context switches like enabling and disabling scissors. This resulted in 1000 calls/frame (and due to the Text implementation, it can increase further with more characters).

This is 50x slowdown. Not really affordable.

bvssvni · 2017-03-23T23:01:56Z

@ishitatsuyuki Yeah, text rendering is really bad right now.

KongouDesu · 2018-04-17T22:53:55Z

What's the current state of this issue, especially in regard to text rendering?

bvssvni · 2018-04-17T23:02:59Z

Texture rendering is now significantly faster for the OpenGL backend, but the glyph cache implementation must be changed to take advantage of this optimization.

bvssvni added information draft labels Jan 11, 2016

This was referenced Jan 19, 2016

Start a library for retained/declarative graphics (2D + 3D) PistonDevelopers/piston#1024

Closed

Redesign CharacterCache #1029

Closed

bvssvni mentioned this issue Mar 8, 2016

Change colored 2D shader to use color per vertex PistonDevelopers/shaders#36

Closed

indiv0 mentioned this issue May 13, 2016

Determine and Document the High-Level Design of rgframework indiv0/colonize#47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draw calls performance bottleneck #1026

Draw calls performance bottleneck #1026

bvssvni commented Jan 11, 2016

mitchmindtree commented Jan 11, 2016

bvssvni commented Feb 1, 2016

crumblingstatue commented Apr 3, 2016

bvssvni commented Apr 3, 2016

crumblingstatue commented Apr 3, 2016

ishitatsuyuki commented Mar 23, 2017

bvssvni commented Mar 23, 2017

KongouDesu commented Apr 17, 2018

bvssvni commented Apr 17, 2018

Draw calls performance bottleneck #1026

Draw calls performance bottleneck #1026

Comments

bvssvni commented Jan 11, 2016

Alternative: Retained API

Summary of plan

mitchmindtree commented Jan 11, 2016

bvssvni commented Feb 1, 2016

crumblingstatue commented Apr 3, 2016

bvssvni commented Apr 3, 2016

crumblingstatue commented Apr 3, 2016

ishitatsuyuki commented Mar 23, 2017

bvssvni commented Mar 23, 2017

KongouDesu commented Apr 17, 2018

bvssvni commented Apr 17, 2018