Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: detector body sorting #1329

Open
cha0s opened this issue Jan 11, 2025 · 8 comments
Open

Performance: detector body sorting #1329

cha0s opened this issue Jan 11, 2025 · 8 comments

Comments

@cha0s
Copy link

cha0s commented Jan 11, 2025

bodies.sort(Detector._compareBoundsX);

The above code is largely inefficient most of the time in practice. This is because quicksort is generally less efficient on mostly-sorted arrays.

Funny as it may sound, the humble insertion sort is a great candidate for sorting mostly-sorted arrays. This is the case a lot of the time in practice: most objects do not move so far every frame as to essentially randomize the bounds sorting.

Replacing the code above with

  for (let i = 1; i < bodies.length; i++) {
    let current = bodies[i];
    let j = i - 1;
    while ((j > -1) && (current.bounds.min.x < bodies[j].bounds.min.x)) {
      bodies[j + 1] = bodies[j];
      j--;
    }
    bodies[j + 1] = current;
  }

runs almost 25% faster.

You may not agree with this and I'm curious what your opinion is.

Regardless of whether you think this sorting strategy makes sense in a general case, I suggest that the original line of code above is moved to a new method on Detector: sortBodies. If this was done, it would be much easier to swap out a sort that makes sense for your simulation by simply overriding Detector.sortBodies. As it stands today, the entire Detector.collisions method most be overridden to change the sorting strategy for bodies.

Thanks for your work!

@ggorlen
Copy link

ggorlen commented Jan 19, 2025

The builtin v8 sort doesn't use quicksort any longer, and when it did, it used insertion sort on arrays of less than ~20 elements, which was optimal at the time. I'm pretty sure most implementations now use timsort.

runs almost 25% faster.

Please share your benchmarking code you used to come to this conclusion.

You may not agree with this and I'm curious what your opinion is.

It shouldn't be a question of opinion, just a matter of proper benchmarking. That said, the chances are that a custom sort implementation will outperform the native v8 is extremely low, I would imagine.

@cha0s
Copy link
Author

cha0s commented Jan 19, 2025

My usecase is a topdown game where a body moves through space and other objects e.g. walk around "relatively" slowly. The temporal coherence in this case is quite high. In other words, the y-order of AABBs is relatively stable frame-to-frame vs. simulating e.g. gas particles bouncing around inside of a volume.

I don't have an isolated benchmark available. A simulation with hundreds of relatively-stable y-ordered bounding boxes is clearly more efficiently sorted by y-order with e.g. insertion sort. I'm not talking about < 20 elements. I assure you that I measured it for my own simulation! It's trivial to outperform the built-in array sort.

I agree that opinion doesn't really factor in. The only relevant opinion is whether the library itself should be optimized for more high-coherence use cases. Personally I suspect that this is, on average, the majority of use cases. I proposed making it easier to simply swap out the sort algorithm without having to write the entire Detector.collisions method for exactly that reason: opinion is irrelevant!

P.S. I don't even sort the bodies in my current implementation. I have written a custom Detector that hashes the world spatially to determine the eligibility of body collision. This blows away the built-in implementation performance which not only does no hashing whatsoever (e.g. will compare objects hundreds or thousands of units away on the x axis), but also copies the entire bodies array (see:

detector.bodies = bodies.slice(0);
) because .sort operates in-place and care is taken to not mutate it directly (though I wonder whether that care is strictly necessary).

Correction: The broadphase is actually done on the X axis, not the Y axis. The points remain the same though!

@cha0s
Copy link
Author

cha0s commented Jan 19, 2025

I looked into the sorting algorithm and you're right as far as I can tell: v8 uses an algorithm called timsort which is actually quite fast on mostly-sorted data.

I think the actual reason it's possible to beat the built-in implementation easily isn't because of the sorting algorithm at all, it's more what I mentioned in my previous comment: every world update completely smashes the previously-sorted array by copying the bodies array.

If that copy never took place and the bodies were simply sorted in-place continuously, the performance would probably increase greatly.

@ggorlen
Copy link

ggorlen commented Jan 19, 2025

It's trivial to outperform the built-in array sort.

The burden of proof is on you here, since the builtin v8 array sort is highly optmized by teams of uber-experts in algorithms, and written in low-level C++ (Torque?) code to be as fast as possible.

As you point out, Timsort is optimized on mostly-sorted arrays, but even if it wasn't, heavily-optimized insertion sort has been used in quicksort implementations for the leaf recursive nodes since time immemorial. There's basically never been a time in recent history where quicksort was used in totally pure form in any JS engine (guessing, but seems like a safe bet).

Similarly, bodies.slice(0); should be highly optimized and extremely fast, even with many thousands of elements (and most MJS simulations don't have hundreds of thousands or millions of bodies). Sure, if you can avoid a copy or a sort on every frame, that's great, but without looking into it, I assume this isn't a superfluous copy.

I don't know much about the internals of MJS, so I'll defer to those that do, but for any optimization proposals, a runnable, reproducible benchmark is pretty much standard practice to provide to kick off the conversation.

@cha0s
Copy link
Author

cha0s commented Jan 19, 2025

You're not seeing the whole picture. In isolation, sort and copy are optimized, but every time you are copy that array you are blowing away the previous sort. This is why it is trivial to outperform the built-in implementation.

It's fine if you don't want to take my word for it. I can just hack the library to fix the performance bottlenecks (which is what I'm doing)! Open source is about sharing what we find instead of keeping it to ourselves, but I'm certainly not interested in forcing the issue.

FYI, copy doesn't necessarily happen on every frame. It happens when the bodies configuration of the world changes. This, of course, may indeed be every frame based on the usage of the engine. As an aside, it is almost unbelievable that the idea of copying the entire world bodies array every frame would be handwaved away as "should be fast". This seems to imply a very fundamental difference between the kind of performance targets the two of us pursue.

@cha0s
Copy link
Author

cha0s commented Jan 20, 2025

This topic is just too silly to ignore :)

I deleted my last comment since I had a bug in the insertion sort.

Here's a benchmark:

const I = 500;
const N = 1000;

function createRandomBodies(N) {
  const bodies = [];
  for (let i = 0; i < N; ++i) {
    bodies.push({
      bounds: {
        max: {x: Math.random() * 500, y: Math.random() * 500},
        min: {x: Math.random() * 500, y: Math.random() * 500},
      },
    });
  }
  return bodies;
}

function nudgeBodies(bodies) {
  const stride = Math.floor(Math.random() * 10) + 1;
  for (let i = 0; i < N; i += stride) {
    const xn = (Math.random() * 50) - 100;
    const yn = (Math.random() * 50) - 100;
    bodies[i].bounds.max.x += xn;
    bodies[i].bounds.min.x += xn;
    bodies[i].bounds.max.y += yn;
    bodies[i].bounds.min.y += yn;
  }
}

const _compareBoundsX = function(bodyA, bodyB) {
  return bodyA.bounds.min.x - bodyB.bounds.min.x;
};


function insertionSort(array) {
  for (let i = 1; i < array.length; i++) {
    let current = array[i];
    let j = i - 1;
    while ((j > -1) && (current.bounds.min.x < array[j].bounds.min.x)) {
      array[j + 1] = array[j];
      j--;
    }
    array[j + 1] = current;
  }
}

const bodies = createRandomBodies(N);

let start;

start = performance.now();
for (let i = 0; i < I; ++i) {
  const builtinSortBodies = bodies.slice(0);
  builtinSortBodies.sort(_compareBoundsX);
}
console.log('builtin full-copy sort took', performance.now() - start, 'ms');

start = performance.now();
for (let i = 0; i < I; ++i) {
  const insertionSortBodies = bodies.slice(0);
  insertionSort(insertionSortBodies);
}
console.log('insertion full-copy sort took', performance.now() - start, 'ms');

const builtinSortBodies = bodies.slice(0);
start = performance.now();
for (let i = 0; i < I; ++i) {
  builtinSortBodies.sort(_compareBoundsX);
}
console.log('builtin in-place sort took', performance.now() - start, 'ms');

const insertionSortBodies = bodies.slice(0);
start = performance.now();
for (let i = 0; i < I; ++i) {
  insertionSort(insertionSortBodies);
}
console.log('insertion in-place sort took', performance.now() - start, 'ms');

start = performance.now();
for (let i = 0; i < I; ++i) {
  const builtinSortBodies = bodies.slice(0);
  nudgeBodies(builtinSortBodies);
  builtinSortBodies.sort(_compareBoundsX);
}
console.log('nudged builtin full-copy sort took', performance.now() - start, 'ms');

start = performance.now();
for (let i = 0; i < I; ++i) {
  const insertionSortBodies = bodies.slice(0);
  nudgeBodies(insertionSortBodies);
  insertionSort(insertionSortBodies);
}
console.log('nudged insertion full-copy sort took', performance.now() - start, 'ms');

const nudgedBuiltinSortBodies = bodies.slice(0);
start = performance.now();
for (let i = 0; i < I; ++i) {
  nudgeBodies(nudgedBuiltinSortBodies);
  nudgedBuiltinSortBodies.sort(_compareBoundsX);
}
console.log('nudged builtin in-place sort took', performance.now() - start, 'ms');

const nudgedInsertionSortBodies = bodies.slice(0);
start = performance.now();
for (let i = 0; i < I; ++i) {
  nudgeBodies(nudgedInsertionSortBodies);
  insertionSort(nudgedInsertionSortBodies);
}
console.log('nudged insertion in-place sort took', performance.now() - start, 'ms');

This is not benchmarking matter in particular, because the problem I'm discussing is actually not even matter-specific.

The benchmark runs 500 iterations over 1000 "bodies". 4 cases are run:

  • Full-copy builtin sort
  • Full-copy insertion sort
  • In-place builtin sort
  • In-place insertion sort
  • "Nudged" versions of each -- every frame the bodies' positions are updated on a random stride

The results:

builtin full-copy sort took 148.09417000000002 ms
insertion full-copy sort took 721.5413179999999 ms
builtin in-place sort took 13.524974999999927 ms
insertion in-place sort took 5.559043000000088 ms
nudged builtin full-copy sort took 161.2790520000001 ms
nudged insertion full-copy sort took 1020.1278499999999 ms
nudged builtin in-place sort took 63.12692900000002 ms
nudged insertion in-place sort took 16.855706000000282 ms

Take-aways:

  • Far from "should be fast", in-place vs. full-copy sorting is an order of magnitude difference
  • In-place insertion sort is roughly 3x faster than in-place builtin sort
  • Insertion sorting is indeed slower when the entire bodies array is blown away every frame

It's trivial to show over an order of magnitude improvement; hopefully this is illuminating.

Of course, if you disagree with the benchmark methodology, feel free to suggest improvements.

Like I already noted, none of this will affect me either way since I have implemented a custom detector! I just think an order of magnitude is worth considering for the stock implementation.

@ggorlen
Copy link

ggorlen commented Jan 20, 2025

Thanks for the benchmark. You're proposing a change to Matter.js, so the specific use case does matter. I don't know the internals so I can't review it, but I'm assuming you can't get rid of the copy, and in that case, insertion sort is destroyed every time by the builtin, likely ending the debate in favor of keeping the builtin.

So I'd imagine adopting insertion sort boils down to:

  1. Showing that the copy is never necessary (unlikely, I'm assuming Liabru put this in for a good reason)
  2. Showing that insertion sort is never going to hit a quadratic edge case that causes it to blow up and create extreme lag (also highly unlikely, I imagine worst-case or average-case complexity will happen all the time with this sort--is nudgeBodies reflective of all MJS steps?)
  3. Showing that performance remains strong with higher N values--1000 is a typical MJS simulation, but I'd expect it should be able to linearly scale to say, 100k bodies as well, or whatever the upper limit might be in the current engine on a typical machine.

Assuming the copy isn't needed at all, the next step I'd take is to actually integrate the insertion sort in MJS and run a number of real-world MJS apps (e.g. create hundreds of thousands of bodies and click around for a minute) and use the browser's profiler to compare performance between the two sorts.

Again, I don't know MJS internals, just trying to move the proof along, so there's probably not much more I can offer.

@cha0s
Copy link
Author

cha0s commented Jan 22, 2025

  1. Even if the copy is necessary, I think such a massive difference in performance at least warrants investigation and possibly some refactoring to remove any tight-coupling that may exist between the integrity of the simulation and the ordering of the bodies in the array. I strongly suspect that there is no logical dependency between the two.
  2. The full-copy insertion sort with randomly-distributed bodies is exactly the worst-case quadratic scenario. It was chosen because of its worst-case property. A real-world scenario would almost certainly perform better, but no worse! There's little point IMO in contriving a "real-world" scenario due to differences in user implementation; it doesn't get worst than the worst-case.
  3. I'm sure at some point the overhead of JS will outpace the savings, so eventually the builtin sort will perform better.

Two points moving forward:

  1. Removing the copy is the real general-case performance win to pursue if general performance is important.
  2. The only concrete action I initially proposed was make a one-line API change to swap out the sort because it's easy to show cases where it would be desirable.

Moving forward, let's just forget about the insertion sort. As I noted in my original comment, the small API change leaves that to developers anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants