Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify and implement SSE/SSE2 optimized sections #103

Open
jarikomppa opened this issue May 26, 2015 · 15 comments
Open

Identify and implement SSE/SSE2 optimized sections #103

jarikomppa opened this issue May 26, 2015 · 15 comments

Comments

@jarikomppa
Copy link
Owner

There are some pieces that could benefit from SSE optimizations. For example, the roundoff clipper was sped up 3x in an experiment.

Other potential places may be the float->16bit converter, cubic resampler, 3d audio calculations, FFT maybe?

Based on profiling, vast majority of time goes to audio sources, though. Some filters may also be potential targets.

@jarikomppa
Copy link
Owner Author

Added SSE optimized clippers. A lot of changes were needed to make sure that the buffers are 16-byte aligned (basically buffers coming from backends & scratch buffers have to be).

Didn't update all backends yet, so some builds may be broken.

@jarikomppa
Copy link
Owner Author

Worked the aligned buffers a bit more, mix() interface is back to using unaligned buffers.

@vk2gpu
Copy link
Contributor

vk2gpu commented May 28, 2015

Could always impose the restriction that buffers passed to mix() must be 16-byte aligned? Optionally have it just fall back to scalar processing if unaligned memory is passed in? Would be nice to catch or warn on this behaviour of course.

@jarikomppa
Copy link
Owner Author

The unaligned buffers thing isn't an issue anymore, I just shuffled the way buffer-to-buffer processing was done in mix(), and got rid of a buffer copy for 16bit samples at the same time.

@jarikomppa
Copy link
Owner Author

Also, requiring aligned buffers over foreign interfaces would not have worked.

@jarikomppa
Copy link
Owner Author

One thing I did consider (and may end up doing in other SSE optimizations should I do more) is to handle the start and end as scalars and the aligned middle as SSE.

@vk2gpu
Copy link
Contributor

vk2gpu commented Feb 21, 2018

Thanks to seeing email notifications SoLoud has been brought to my attention again....

One thing I have been thinking about for when I get round to adding sound back into my home engine is looking at writing some ISPC kernels for performing mixing, clipping, filters, etc. Curious how you'd feel about that? I wouldn't say it needs to be embedded into SoLoud itself, but the ability to provide your own functions to perform that processing could be nice (perhaps just on a #define rather than adding to the interfaces)

@stegei
Copy link

stegei commented May 15, 2018

There is an other issue with the SSE clipping, if the number of samples requested is not a multiple of 4.
I fixed this by making sure that the scratch buffer size is always rounded up and by changing clip to process a rounded up number of samples. This will clip up to 3 additional samples at the end that contain undefined data, but that shouldn't be an issue since they won't be read.

Changed code:
postinit: mScratchSize = (aBufferSize+3)&~0x3;
clip: for (i = 0; i < (aSamples+3) / 4; i++)

@jarikomppa
Copy link
Owner Author

jarikomppa commented May 15, 2018 via email

@stegei
Copy link

stegei commented May 16, 2018

I'm using a custom Wasapi backend which uses a buffer size that is a multiple of the minimum device period (as reported by audioClient->GetDevicePeriod).

@jarikomppa
Copy link
Owner Author

jarikomppa commented May 16, 2018 via email

@osman-turan
Copy link
Contributor

#235 adds support for float to short interlacing for mono and stereo streams (processes 16 floats at a time for either cases). I ended up using 2 SSE2 instructions which make it easier to implement. Since, SSE2 was first introduced in 2001, I think it won't be much problem. Maybe, we could add another macro to control SSE2 blocks instead of SOLOUD_SSE_INTRINSICS. Because, the macro name currently is misleading. Another option might be to include following macros instead of it:

SOLOUD_ASM_INTRINSICS // Controls all assembly intrinsics at once
SOLOUD_ASM_INTRINSICS_SSE // Controls only SSE intrinsics
SOLOUD_ASM_INTRINSICS_SSE2 // Controls only SSE/SSE2 intrinsics
// ...
SOLOUD_ASM_INTRINSICS_NEON // Controls only Neon intrinsics
// etc.

@jarikomppa
Copy link
Owner Author

According to steam hardware survey, SSE3 is now safe to use. SSE2 has been the default ISA in visual studio for a few years now, so SSE2 is definitely not a problem.

panAndExpand shows up as a major player in performance metrics, so it looks like a good candidate for SIMD.

@jarikomppa
Copy link
Owner Author

Recent release sets fpu flags for audio thread to consider denorms as zero, which had a clear performance boost in a simple synthetic test.

@jarikomppa
Copy link
Owner Author

Added simd optimizations for panAndExpand cases 1->2 and 2->2, which are (probably) the most common ones. Speedup was massive - test that took 0.150 seconds now takes 0.030.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants