Performance: detector body sorting #1329

cha0s · 2025-01-11T05:21:22Z

Line 72 in f9208df

bodies.sort(Detector._compareBoundsX);

The above code is largely inefficient most of the time in practice. This is because quicksort is generally less efficient on mostly-sorted arrays.

Funny as it may sound, the humble insertion sort is a great candidate for sorting mostly-sorted arrays. This is the case a lot of the time in practice: most objects do not move so far every frame as to essentially randomize the bounds sorting.

Replacing the code above with

  for (let i = 1; i < bodies.length; i++) {
    let current = bodies[i];
    let j = i - 1;
    while ((j > -1) && (current.bounds.min.x < bodies[j].bounds.min.x)) {
      bodies[j + 1] = bodies[j];
      j--;
    }
    bodies[j + 1] = current;
  }

runs almost 25% faster.

You may not agree with this and I'm curious what your opinion is.

Regardless of whether you think this sorting strategy makes sense in a general case, I suggest that the original line of code above is moved to a new method on Detector: sortBodies. If this was done, it would be much easier to swap out a sort that makes sense for your simulation by simply overriding Detector.sortBodies. As it stands today, the entire Detector.collisions method most be overridden to change the sorting strategy for bodies.

Thanks for your work!

The text was updated successfully, but these errors were encountered:

ggorlen · 2025-01-19T20:15:58Z

The builtin v8 sort doesn't use quicksort any longer, and when it did, it used insertion sort on arrays of less than ~20 elements, which was optimal at the time. I'm pretty sure most implementations now use timsort.

runs almost 25% faster.

Please share your benchmarking code you used to come to this conclusion.

You may not agree with this and I'm curious what your opinion is.

It shouldn't be a question of opinion, just a matter of proper benchmarking. That said, the chances are that a custom sort implementation will outperform the native v8 is extremely low, I would imagine.

cha0s · 2025-01-19T23:02:40Z

My usecase is a topdown game where a body moves through space and other objects e.g. walk around "relatively" slowly. The temporal coherence in this case is quite high. In other words, the y-order of AABBs is relatively stable frame-to-frame vs. simulating e.g. gas particles bouncing around inside of a volume.

I don't have an isolated benchmark available. A simulation with hundreds of relatively-stable y-ordered bounding boxes is clearly more efficiently sorted by y-order with e.g. insertion sort. I'm not talking about < 20 elements. I assure you that I measured it for my own simulation! It's trivial to outperform the built-in array sort.

I agree that opinion doesn't really factor in. The only relevant opinion is whether the library itself should be optimized for more high-coherence use cases. Personally I suspect that this is, on average, the majority of use cases. I proposed making it easier to simply swap out the sort algorithm without having to write the entire Detector.collisions method for exactly that reason: opinion is irrelevant!

P.S. I don't even sort the bodies in my current implementation. I have written a custom Detector that hashes the world spatially to determine the eligibility of body collision. This blows away the built-in implementation performance which not only does no hashing whatsoever (e.g. will compare objects hundreds or thousands of units away on the x axis), but also copies the entire bodies array (see:

matter-js/src/collision/Detector.js

Line 39 in f9208df

detector.bodies = bodies.slice(0);

) because .sort operates in-place and care is taken to not mutate it directly (though I wonder whether that care is strictly necessary).

Correction: The broadphase is actually done on the X axis, not the Y axis. The points remain the same though!

cha0s · 2025-01-19T23:25:35Z

I looked into the sorting algorithm and you're right as far as I can tell: v8 uses an algorithm called timsort which is actually quite fast on mostly-sorted data.

I think the actual reason it's possible to beat the built-in implementation easily isn't because of the sorting algorithm at all, it's more what I mentioned in my previous comment: every world update completely smashes the previously-sorted array by copying the bodies array.

If that copy never took place and the bodies were simply sorted in-place continuously, the performance would probably increase greatly.

ggorlen · 2025-01-19T23:35:48Z

It's trivial to outperform the built-in array sort.

The burden of proof is on you here, since the builtin v8 array sort is highly optmized by teams of uber-experts in algorithms, and written in low-level C++ (Torque?) code to be as fast as possible.

As you point out, Timsort is optimized on mostly-sorted arrays, but even if it wasn't, heavily-optimized insertion sort has been used in quicksort implementations for the leaf recursive nodes since time immemorial. There's basically never been a time in recent history where quicksort was used in totally pure form in any JS engine (guessing, but seems like a safe bet).

Similarly, bodies.slice(0); should be highly optimized and extremely fast, even with many thousands of elements (and most MJS simulations don't have hundreds of thousands or millions of bodies). Sure, if you can avoid a copy or a sort on every frame, that's great, but without looking into it, I assume this isn't a superfluous copy.

I don't know much about the internals of MJS, so I'll defer to those that do, but for any optimization proposals, a runnable, reproducible benchmark is pretty much standard practice to provide to kick off the conversation.

cha0s · 2025-01-19T23:57:37Z

You're not seeing the whole picture. In isolation, sort and copy are optimized, but every time you are copy that array you are blowing away the previous sort. This is why it is trivial to outperform the built-in implementation.

It's fine if you don't want to take my word for it. I can just hack the library to fix the performance bottlenecks (which is what I'm doing)! Open source is about sharing what we find instead of keeping it to ourselves, but I'm certainly not interested in forcing the issue.

FYI, copy doesn't necessarily happen on every frame. It happens when the bodies configuration of the world changes. This, of course, may indeed be every frame based on the usage of the engine. As an aside, it is almost unbelievable that the idea of copying the entire world bodies array every frame would be handwaved away as "should be fast". This seems to imply a very fundamental difference between the kind of performance targets the two of us pursue.

cha0s · 2025-01-20T01:20:19Z

This topic is just too silly to ignore :)

I deleted my last comment since I had a bug in the insertion sort.

Here's a benchmark:

const I = 500;
const N = 1000;

function createRandomBodies(N) {
  const bodies = [];
  for (let i = 0; i < N; ++i) {
    bodies.push({
      bounds: {
        max: {x: Math.random() * 500, y: Math.random() * 500},
        min: {x: Math.random() * 500, y: Math.random() * 500},
      },
    });
  }
  return bodies;
}

function nudgeBodies(bodies) {
  const stride = Math.floor(Math.random() * 10) + 1;
  for (let i = 0; i < N; i += stride) {
    const xn = (Math.random() * 50) - 100;
    const yn = (Math.random() * 50) - 100;
    bodies[i].bounds.max.x += xn;
    bodies[i].bounds.min.x += xn;
    bodies[i].bounds.max.y += yn;
    bodies[i].bounds.min.y += yn;
  }
}

const _compareBoundsX = function(bodyA, bodyB) {
  return bodyA.bounds.min.x - bodyB.bounds.min.x;
};


function insertionSort(array) {
  for (let i = 1; i < array.length; i++) {
    let current = array[i];
    let j = i - 1;
    while ((j > -1) && (current.bounds.min.x < array[j].bounds.min.x)) {
      array[j + 1] = array[j];
      j--;
    }
    array[j + 1] = current;
  }
}

const bodies = createRandomBodies(N);

let start;

start = performance.now();
for (let i = 0; i < I; ++i) {
  const builtinSortBodies = bodies.slice(0);
  builtinSortBodies.sort(_compareBoundsX);
}
console.log('builtin full-copy sort took', performance.now() - start, 'ms');

start = performance.now();
for (let i = 0; i < I; ++i) {
  const insertionSortBodies = bodies.slice(0);
  insertionSort(insertionSortBodies);
}
console.log('insertion full-copy sort took', performance.now() - start, 'ms');

const builtinSortBodies = bodies.slice(0);
start = performance.now();
for (let i = 0; i < I; ++i) {
  builtinSortBodies.sort(_compareBoundsX);
}
console.log('builtin in-place sort took', performance.now() - start, 'ms');

const insertionSortBodies = bodies.slice(0);
start = performance.now();
for (let i = 0; i < I; ++i) {
  insertionSort(insertionSortBodies);
}
console.log('insertion in-place sort took', performance.now() - start, 'ms');

start = performance.now();
for (let i = 0; i < I; ++i) {
  const builtinSortBodies = bodies.slice(0);
  nudgeBodies(builtinSortBodies);
  builtinSortBodies.sort(_compareBoundsX);
}
console.log('nudged builtin full-copy sort took', performance.now() - start, 'ms');

start = performance.now();
for (let i = 0; i < I; ++i) {
  const insertionSortBodies = bodies.slice(0);
  nudgeBodies(insertionSortBodies);
  insertionSort(insertionSortBodies);
}
console.log('nudged insertion full-copy sort took', performance.now() - start, 'ms');

const nudgedBuiltinSortBodies = bodies.slice(0);
start = performance.now();
for (let i = 0; i < I; ++i) {
  nudgeBodies(nudgedBuiltinSortBodies);
  nudgedBuiltinSortBodies.sort(_compareBoundsX);
}
console.log('nudged builtin in-place sort took', performance.now() - start, 'ms');

const nudgedInsertionSortBodies = bodies.slice(0);
start = performance.now();
for (let i = 0; i < I; ++i) {
  nudgeBodies(nudgedInsertionSortBodies);
  insertionSort(nudgedInsertionSortBodies);
}
console.log('nudged insertion in-place sort took', performance.now() - start, 'ms');

This is not benchmarking matter in particular, because the problem I'm discussing is actually not even matter-specific.

The benchmark runs 500 iterations over 1000 "bodies". 4 cases are run:

Full-copy builtin sort
Full-copy insertion sort
In-place builtin sort
In-place insertion sort
"Nudged" versions of each -- every frame the bodies' positions are updated on a random stride

The results:

builtin full-copy sort took 148.09417000000002 ms
insertion full-copy sort took 721.5413179999999 ms
builtin in-place sort took 13.524974999999927 ms
insertion in-place sort took 5.559043000000088 ms
nudged builtin full-copy sort took 161.2790520000001 ms
nudged insertion full-copy sort took 1020.1278499999999 ms
nudged builtin in-place sort took 63.12692900000002 ms
nudged insertion in-place sort took 16.855706000000282 ms

Take-aways:

Far from "should be fast", in-place vs. full-copy sorting is an order of magnitude difference
In-place insertion sort is roughly 3x faster than in-place builtin sort
Insertion sorting is indeed slower when the entire bodies array is blown away every frame

It's trivial to show over an order of magnitude improvement; hopefully this is illuminating.

Of course, if you disagree with the benchmark methodology, feel free to suggest improvements.

Like I already noted, none of this will affect me either way since I have implemented a custom detector! I just think an order of magnitude is worth considering for the stock implementation.

ggorlen · 2025-01-20T16:46:07Z

Thanks for the benchmark. You're proposing a change to Matter.js, so the specific use case does matter. I don't know the internals so I can't review it, but I'm assuming you can't get rid of the copy, and in that case, insertion sort is destroyed every time by the builtin, likely ending the debate in favor of keeping the builtin.

So I'd imagine adopting insertion sort boils down to:

Showing that the copy is never necessary (unlikely, I'm assuming Liabru put this in for a good reason)
Showing that insertion sort is never going to hit a quadratic edge case that causes it to blow up and create extreme lag (also highly unlikely, I imagine worst-case or average-case complexity will happen all the time with this sort--is nudgeBodies reflective of all MJS steps?)
Showing that performance remains strong with higher N values--1000 is a typical MJS simulation, but I'd expect it should be able to linearly scale to say, 100k bodies as well, or whatever the upper limit might be in the current engine on a typical machine.

Assuming the copy isn't needed at all, the next step I'd take is to actually integrate the insertion sort in MJS and run a number of real-world MJS apps (e.g. create hundreds of thousands of bodies and click around for a minute) and use the browser's profiler to compare performance between the two sorts.

Again, I don't know MJS internals, just trying to move the proof along, so there's probably not much more I can offer.

cha0s · 2025-01-22T03:21:18Z

Even if the copy is necessary, I think such a massive difference in performance at least warrants investigation and possibly some refactoring to remove any tight-coupling that may exist between the integrity of the simulation and the ordering of the bodies in the array. I strongly suspect that there is no logical dependency between the two.
The full-copy insertion sort with randomly-distributed bodies is exactly the worst-case quadratic scenario. It was chosen because of its worst-case property. A real-world scenario would almost certainly perform better, but no worse! There's little point IMO in contriving a "real-world" scenario due to differences in user implementation; it doesn't get worst than the worst-case.
I'm sure at some point the overhead of JS will outpace the savings, so eventually the builtin sort will perform better.

Two points moving forward:

Removing the copy is the real general-case performance win to pursue if general performance is important.
The only concrete action I initially proposed was make a one-line API change to swap out the sort because it's easy to show cases where it would be desirable.

Moving forward, let's just forget about the insertion sort. As I noted in my original comment, the small API change leaves that to developers anyway.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: detector body sorting #1329

Performance: detector body sorting #1329

cha0s commented Jan 11, 2025 •

edited

Loading

ggorlen commented Jan 19, 2025 •

edited

Loading

cha0s commented Jan 19, 2025 •

edited

Loading

cha0s commented Jan 19, 2025

ggorlen commented Jan 19, 2025 •

edited

Loading

cha0s commented Jan 19, 2025

cha0s commented Jan 20, 2025 •

edited

Loading

ggorlen commented Jan 20, 2025 •

edited

Loading

cha0s commented Jan 22, 2025

Performance: detector body sorting #1329

Performance: detector body sorting #1329

Comments

cha0s commented Jan 11, 2025 • edited Loading

ggorlen commented Jan 19, 2025 • edited Loading

cha0s commented Jan 19, 2025 • edited Loading

cha0s commented Jan 19, 2025

ggorlen commented Jan 19, 2025 • edited Loading

cha0s commented Jan 19, 2025

cha0s commented Jan 20, 2025 • edited Loading

ggorlen commented Jan 20, 2025 • edited Loading

cha0s commented Jan 22, 2025

cha0s commented Jan 11, 2025 •

edited

Loading

ggorlen commented Jan 19, 2025 •

edited

Loading

cha0s commented Jan 19, 2025 •

edited

Loading

ggorlen commented Jan 19, 2025 •

edited

Loading

cha0s commented Jan 20, 2025 •

edited

Loading

ggorlen commented Jan 20, 2025 •

edited

Loading