Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draw calls performance bottleneck #1026

Open
bvssvni opened this issue Jan 11, 2016 · 9 comments
Open

Draw calls performance bottleneck #1026

bvssvni opened this issue Jan 11, 2016 · 9 comments

Comments

@bvssvni
Copy link
Member

bvssvni commented Jan 11, 2016

This issue is to help people understanding the picture of what causes a performance bottleneck in piston-graphics, and what the plan is to fix it.

For each draw call, the CPU need to send data to the GPU. The GPU is often very fast at rendering. If this capacity is not used fully, the GPU sits and waits for more input from the CPU.

GPUs are designed for handling massive amounts of data with a limited set of variation. What the GPU does is controlled through a shader language. For OpenGL the shader language is GLSL.

When you render a rectangle, this is what happens:

  1. Transformed triangles are created on the CPU in chunks and sent to the graphics backend.
  2. The backend writes the received data to dynamic buffers.
  3. The graphics driver tells the GPU to render using the updated buffers.
  4. The GPU renders using a precompiled shader and paint pixels in the frame buffer.
  5. The frame buffer is swapped with the current one to update the display

Step 1-4 happens repeatedly when drawing many objects for each frame.

In the Gfx backend the draw commands are collected upfront and given to the driver at the same time. However, from the graphics driver side, the instructions seems similar to the ones generated by the OpenGL backend (except for changes made the draw state).

The 1st step is done by piston-graphics's design. Reasons to triangulate on the CPU:

  • take advantage of f64 precision of matrix transformations, which are not that frequently supported on the GPU
  • non-linear transforms, such as least square deforming
  • easy to implement a new backend, with potential for one using software rasterization
  • great flexibility for the generic libraries depending on piston-graphics

Some questions one might ask:

  • Is the 1st step or 2nd step the primary cause of the bottleneck?
  • Is it possible to reduce the overhead without making design tradeoffs?

Before making changes to the design, one might consider using the strengths it offers to fix the problem. It seems the largest overhead is the number of draw calls, and since reducing the number of draw calls will lead to less overhead, we should looks for ways to do that first. This happens in the 2nd step, not the 1st!

Batch, batch, batch!

The key insight here is that since piston-graphics triangulates on the CPU, we could pack multiple shapes into the same buffer in the backend. This leads to fewer draw calls when:

  • The draw state is the same between calls
  • The same shader program is used (colored vs textured)
  • The same texture is used

One downside is that many backend instances leads to higher memory usage. Based on experience so far most applications only use one instance, so I do not think this is a problem.

For example, in Conrod a lot of solid colored shapes are rendered, then some textured shapes (text) and then more solid colored shapes etc. Currently the CharacterCache backends rasterizes glyphs using Freetype for each character in a separate texture. This means we can reduce the number of draw calls for solid shapes, but not for text.

In the case of text, we could try two different approaches:

  1. Pack glyphs in a single image and update a texture
  2. Since glyphs are often of similar size, consider using texture arrays

Number one seems sensible to test first because it would benefit from the same reduction of draw calls. However, it requires some changes:

  • Character should take &'a T to the texture, separating offset and size from texture storage internally in the glyph cache
  • Change CharacterCache::character to return Character<'a, T>

Alternative: Retained API

By organizing graphic primitives into a tree structure, one can traverse it and optimize the draw calls.

While this would be very interesting to work on, there are some major obstacles/unknowns:

  • One might want to use one tree structure for both 2D and 3D
  • Judging from previous projects, such as Ogre3D, writing a tree structure for 3D is complicated and might take years to mature, and would benefit from new language features in Rust to be extensible
  • It is not obvious why a retained API should be much faster than an immediate design writing directly into large buffers, because it depends on the method of sorting/optimization and input data
  • Seem to be a better idea to design a retained API based on project experience, rather than to fix a single optimization problem with 2D

Summary of plan

  1. Change design of Character and CharacterCache
  2. Write to larger buffers in the backends

I believe this plan requires minimum effort and least amount of breaking changes. We keep the same overall design of piston-graphics and the existing benefits.

@mitchmindtree
Copy link
Contributor

Sounds great 👍

@bvssvni
Copy link
Member Author

bvssvni commented Feb 1, 2016

This also requires changing the shaders from using a uniform color to one color per vertex. Triangles from different shapes gets packed into the same buffer, so their color must be separated.

@crumblingstatue
Copy link

I absolutely love https://love2d.org/wiki/SpriteBatch.
It would be nice to have a similar feature in Piston.
It's kind of off-putting when your Rust game runs slower than Lua because of the drawing overhead.

@bvssvni
Copy link
Member Author

bvssvni commented Apr 3, 2016

@crumblingstatue Can you open a new issue about it? Thanks!

@crumblingstatue
Copy link

@crumblingstatue Can you open a new issue about it? Thanks!

Alright, I opened #1041.

@ishitatsuyuki
Copy link

I've found the text renderer horrible. The minimal overhead is about 23 calls/frame (rusttype's gpu_cache example).

However, Piston doesn't batch it at all, do many context switches like enabling and disabling scissors. This resulted in 1000 calls/frame (and due to the Text implementation, it can increase further with more characters).

This is 50x slowdown. Not really affordable.

@bvssvni
Copy link
Member Author

bvssvni commented Mar 23, 2017

@ishitatsuyuki Yeah, text rendering is really bad right now.

@KongouDesu
Copy link

What's the current state of this issue, especially in regard to text rendering?

@bvssvni
Copy link
Member Author

bvssvni commented Apr 17, 2018

Texture rendering is now significantly faster for the OpenGL backend, but the glyph cache implementation must be changed to take advantage of this optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants