Identify and implement SSE/SSE2 optimized sections #103

jarikomppa · 2015-05-26T17:42:34Z

There are some pieces that could benefit from SSE optimizations. For example, the roundoff clipper was sped up 3x in an experiment.

Other potential places may be the float->16bit converter, cubic resampler, 3d audio calculations, FFT maybe?

Based on profiling, vast majority of time goes to audio sources, though. Some filters may also be potential targets.

jarikomppa · 2015-05-27T20:07:46Z

Added SSE optimized clippers. A lot of changes were needed to make sure that the buffers are 16-byte aligned (basically buffers coming from backends & scratch buffers have to be).

Didn't update all backends yet, so some builds may be broken.

jarikomppa · 2015-05-28T06:26:31Z

Worked the aligned buffers a bit more, mix() interface is back to using unaligned buffers.

vk2gpu · 2015-05-28T06:33:10Z

Could always impose the restriction that buffers passed to mix() must be 16-byte aligned? Optionally have it just fall back to scalar processing if unaligned memory is passed in? Would be nice to catch or warn on this behaviour of course.

jarikomppa · 2015-05-28T06:36:50Z

The unaligned buffers thing isn't an issue anymore, I just shuffled the way buffer-to-buffer processing was done in mix(), and got rid of a buffer copy for 16bit samples at the same time.

jarikomppa · 2015-05-28T06:37:33Z

Also, requiring aligned buffers over foreign interfaces would not have worked.

jarikomppa · 2015-05-28T06:38:23Z

One thing I did consider (and may end up doing in other SSE optimizations should I do more) is to handle the start and end as scalars and the aligned middle as SSE.

vk2gpu · 2018-02-21T09:41:42Z

Thanks to seeing email notifications SoLoud has been brought to my attention again....

One thing I have been thinking about for when I get round to adding sound back into my home engine is looking at writing some ISPC kernels for performing mixing, clipping, filters, etc. Curious how you'd feel about that? I wouldn't say it needs to be embedded into SoLoud itself, but the ability to provide your own functions to perform that processing could be nice (perhaps just on a #define rather than adding to the interfaces)

stegei · 2018-05-15T13:42:59Z

There is an other issue with the SSE clipping, if the number of samples requested is not a multiple of 4.
I fixed this by making sure that the scratch buffer size is always rounded up and by changing clip to process a rounded up number of samples. This will clip up to 3 additional samples at the end that contain undefined data, but that shouldn't be an issue since they won't be read.

Changed code:
postinit: mScratchSize = (aBufferSize+3)&~0x3;
clip: for (i = 0; i < (aSamples+3) / 4; i++)

jarikomppa · 2018-05-15T15:15:47Z

Wait, what? Sample buffer that's not divisible by 4? What back end?

…

On Tue, May 15, 2018, 16:43 stegei ***@***.***> wrote: There is an other issue with the SSE clipping, if the number of samples requested is not a multiple of 4. I fixed this by making sure that the scratch buffer size is always rounded up and by changing clip to process a rounded up number of samples. This will clip up to 3 additional samples at the end that contain undefined data, but that shouldn't be an issue since they won't be read. Changed code: postinit: mScratchSize = (aBufferSize+3)&~0x3; clip: for (i = 0; i < (aSamples+3) / 4; i++) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#103 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEQ_R571rQ9NFHSZYWZlDpGm2kcD74gpks5tyttmgaJpZM4Eqj47> .

stegei · 2018-05-16T09:30:34Z

I'm using a custom Wasapi backend which uses a buffer size that is a multiple of the minimum device period (as reported by audioClient->GetDevicePeriod).

jarikomppa · 2018-05-16T10:11:53Z

Curious. I wouldn't have expected that. Thanks for catching this.

…

On Wed, May 16, 2018 at 12:30 PM, stegei ***@***.***> wrote: I'm using a custom Wasapi backend which uses a buffer size that is a multiple of the minimum device period (as reported by audioClient->GetDevicePeriod). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#103 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEQ_R3CO1DS78MeE6Z66EB_TYb2qk64mks5ty_G9gaJpZM4Eqj47> .

See: jarikomppa#103

osman-turan · 2019-10-30T15:02:52Z

#235 adds support for float to short interlacing for mono and stereo streams (processes 16 floats at a time for either cases). I ended up using 2 SSE2 instructions which make it easier to implement. Since, SSE2 was first introduced in 2001, I think it won't be much problem. Maybe, we could add another macro to control SSE2 blocks instead of SOLOUD_SSE_INTRINSICS. Because, the macro name currently is misleading. Another option might be to include following macros instead of it:

SOLOUD_ASM_INTRINSICS // Controls all assembly intrinsics at once
SOLOUD_ASM_INTRINSICS_SSE // Controls only SSE intrinsics
SOLOUD_ASM_INTRINSICS_SSE2 // Controls only SSE/SSE2 intrinsics
// ...
SOLOUD_ASM_INTRINSICS_NEON // Controls only Neon intrinsics
// etc.

jarikomppa · 2020-02-07T13:30:12Z

According to steam hardware survey, SSE3 is now safe to use. SSE2 has been the default ISA in visual studio for a few years now, so SSE2 is definitely not a problem.

panAndExpand shows up as a major player in performance metrics, so it looks like a good candidate for SIMD.

jarikomppa · 2020-02-22T08:00:16Z

Recent release sets fpu flags for audio thread to consider denorms as zero, which had a clear performance boost in a simple synthetic test.

jarikomppa · 2020-02-22T14:08:44Z

Added simd optimizations for panAndExpand cases 1->2 and 2->2, which are (probably) the most common ones. Speedup was massive - test that took 0.150 seconds now takes 0.030.

osman-turan added a commit to osman-turan/soloud that referenced this issue Oct 30, 2019

Add SSE/SSE2 accelerated float to short interlacing

6c99dd8

See: jarikomppa#103

osman-turan added a commit to osman-turan/soloud that referenced this issue Oct 30, 2019

Add SSE/SSE2 accelerated float to short interlacing

3b73450

See: jarikomppa#103

osman-turan mentioned this issue Oct 30, 2019

Add SSE/SSE2 accelerated float to short interlacing #235

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify and implement SSE/SSE2 optimized sections #103

Identify and implement SSE/SSE2 optimized sections #103

jarikomppa commented May 26, 2015

jarikomppa commented May 27, 2015

jarikomppa commented May 28, 2015

vk2gpu commented May 28, 2015

jarikomppa commented May 28, 2015

jarikomppa commented May 28, 2015

jarikomppa commented May 28, 2015

vk2gpu commented Feb 21, 2018

stegei commented May 15, 2018

jarikomppa commented May 15, 2018 via email

stegei commented May 16, 2018

jarikomppa commented May 16, 2018 via email

osman-turan commented Oct 30, 2019

jarikomppa commented Feb 7, 2020

jarikomppa commented Feb 22, 2020

jarikomppa commented Feb 22, 2020

Identify and implement SSE/SSE2 optimized sections #103

Identify and implement SSE/SSE2 optimized sections #103

Comments

jarikomppa commented May 26, 2015

jarikomppa commented May 27, 2015

jarikomppa commented May 28, 2015

vk2gpu commented May 28, 2015

jarikomppa commented May 28, 2015

jarikomppa commented May 28, 2015

jarikomppa commented May 28, 2015

vk2gpu commented Feb 21, 2018

stegei commented May 15, 2018

jarikomppa commented May 15, 2018 via email

stegei commented May 16, 2018

jarikomppa commented May 16, 2018 via email

osman-turan commented Oct 30, 2019

jarikomppa commented Feb 7, 2020

jarikomppa commented Feb 22, 2020

jarikomppa commented Feb 22, 2020