-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify and implement SSE/SSE2 optimized sections #103
Comments
Added SSE optimized clippers. A lot of changes were needed to make sure that the buffers are 16-byte aligned (basically buffers coming from backends & scratch buffers have to be). Didn't update all backends yet, so some builds may be broken. |
Worked the aligned buffers a bit more, mix() interface is back to using unaligned buffers. |
Could always impose the restriction that buffers passed to mix() must be 16-byte aligned? Optionally have it just fall back to scalar processing if unaligned memory is passed in? Would be nice to catch or warn on this behaviour of course. |
The unaligned buffers thing isn't an issue anymore, I just shuffled the way buffer-to-buffer processing was done in mix(), and got rid of a buffer copy for 16bit samples at the same time. |
Also, requiring aligned buffers over foreign interfaces would not have worked. |
One thing I did consider (and may end up doing in other SSE optimizations should I do more) is to handle the start and end as scalars and the aligned middle as SSE. |
Thanks to seeing email notifications SoLoud has been brought to my attention again.... One thing I have been thinking about for when I get round to adding sound back into my home engine is looking at writing some ISPC kernels for performing mixing, clipping, filters, etc. Curious how you'd feel about that? I wouldn't say it needs to be embedded into SoLoud itself, but the ability to provide your own functions to perform that processing could be nice (perhaps just on a #define rather than adding to the interfaces) |
There is an other issue with the SSE clipping, if the number of samples requested is not a multiple of 4. Changed code: |
Wait, what? Sample buffer that's not divisible by 4? What back end?
…On Tue, May 15, 2018, 16:43 stegei ***@***.***> wrote:
There is an other issue with the SSE clipping, if the number of samples
requested is not a multiple of 4.
I fixed this by making sure that the scratch buffer size is always rounded
up and by changing clip to process a rounded up number of samples. This
will clip up to 3 additional samples at the end that contain undefined
data, but that shouldn't be an issue since they won't be read.
Changed code:
postinit: mScratchSize = (aBufferSize+3)&~0x3;
clip: for (i = 0; i < (aSamples+3) / 4; i++)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#103 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEQ_R571rQ9NFHSZYWZlDpGm2kcD74gpks5tyttmgaJpZM4Eqj47>
.
|
I'm using a custom Wasapi backend which uses a buffer size that is a multiple of the minimum device period (as reported by audioClient->GetDevicePeriod). |
Curious. I wouldn't have expected that. Thanks for catching this.
…On Wed, May 16, 2018 at 12:30 PM, stegei ***@***.***> wrote:
I'm using a custom Wasapi backend which uses a buffer size that is a
multiple of the minimum device period (as reported by
audioClient->GetDevicePeriod).
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#103 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEQ_R3CO1DS78MeE6Z66EB_TYb2qk64mks5ty_G9gaJpZM4Eqj47>
.
|
#235 adds support for float to short interlacing for mono and stereo streams (processes 16 floats at a time for either cases). I ended up using 2 SSE2 instructions which make it easier to implement. Since, SSE2 was first introduced in 2001, I think it won't be much problem. Maybe, we could add another macro to control SSE2 blocks instead of SOLOUD_ASM_INTRINSICS // Controls all assembly intrinsics at once
SOLOUD_ASM_INTRINSICS_SSE // Controls only SSE intrinsics
SOLOUD_ASM_INTRINSICS_SSE2 // Controls only SSE/SSE2 intrinsics
// ...
SOLOUD_ASM_INTRINSICS_NEON // Controls only Neon intrinsics
// etc. |
According to steam hardware survey, SSE3 is now safe to use. SSE2 has been the default ISA in visual studio for a few years now, so SSE2 is definitely not a problem. panAndExpand shows up as a major player in performance metrics, so it looks like a good candidate for SIMD. |
Recent release sets fpu flags for audio thread to consider denorms as zero, which had a clear performance boost in a simple synthetic test. |
Added simd optimizations for panAndExpand cases 1->2 and 2->2, which are (probably) the most common ones. Speedup was massive - test that took 0.150 seconds now takes 0.030. |
There are some pieces that could benefit from SSE optimizations. For example, the roundoff clipper was sped up 3x in an experiment.
Other potential places may be the float->16bit converter, cubic resampler, 3d audio calculations, FFT maybe?
Based on profiling, vast majority of time goes to audio sources, though. Some filters may also be potential targets.
The text was updated successfully, but these errors were encountered: