support DataLevel0BlocksMemory data struct #577

Axlgrep · 2024-07-27T08:43:38Z

support DataLevel0BlocksMemory data struct to solve the realloc huge memory doubling problem

The current data_level0_memory_ continuous memory block design can result in memory doubling during large index resize, which is very unfriendly to memory and performance(the realloc operation may trigger an expensive data copy action, see the quote below)，so we redesigned data_level0_memory_ (using multiple small memory blocks linked together).
In addition, we found some obvious _mm_prefetch out-of-bounds access issues and fixed them (there may be other hidden issues, please check them as well).

realloc

Reallocates the given area of memory. If ptr is not NULL, it must be previously allocated by malloc, calloc or realloc and not yet freed with a call to free or realloc. Otherwise, the results are undefined.
The reallocation is done by either:

expanding or contracting the existing area pointed to by ptr, if possible. The contents of the area remain unchanged up to the lesser of the new and old sizes. If the area is expanded, the contents of the new part of the array are undefined.

allocating a new memory block of size new_size bytes, copying memory area with size equal the lesser of the new and the old sizes, and freeing the old block.

Performance

Environment

All of the benchmarks are run on the same Tencent instance. Here are the details of the test setup:

Instance type: S5.12XLARGE128, 48 CPU, 128 GB Memory, 800GiB SSD.
Kernel version: Linux 5.4.119-1-tlinux4-0008 x86_644.

Test data

Thread Num: 48
We constructed (randomly generated) 1 million data(vector dimension: 1024) for testing.

addPoint Performance

M	ef_construction	QPS	Latency	use_small_blocks_memory
16	200	3159	0.316 ms	true
16	200	3180	0.314 ms	false
16	500	1429	0.699 ms	true
16	500	1435	0.697 ms	false
32	200	1401	0.713 ms	true
32	200	1444	0.692 ms	false
32	500	703	1.422 ms	true
32	500	708	1.412 ms	false

searchKnn Performance

M	ef_construction	topk	QPS	Latency	use_small_blocks_memory
16	200	10	40209	0.025 ms	true
16	200	10	40337	0.025 ms	false
16	500	10	39511	0.025 ms	true
16	500	10	40564	0.025 ms	false
32	200	10	19507	0.051 ms	true
32	200	10	20136	0.051 ms	false
32	500	10	19562	0.051 ms	true
32	500	10	19545	0.051 ms	false

@yurymalkov Please take a look at this MR, thank you...

…memory doubling problem

yurymalkov · 2024-07-29T04:04:45Z

Thank you for the update! I wonder if you've looked into if the speed is the same before and after the code change for flat memory?
I can imagine adding an additional if can affect the speed of things. If it affects it, there are some possible workarounds.

Axlgrep · 2024-07-29T12:57:33Z

Ok, I'll update the test results later.

Axlgrep · 2024-07-31T11:57:30Z

Git commit: 020de1a
Test with hnsw lib from before support DataLevel0BlocksMemory .

M	ef_construction	addPoint QPS	addPoint Latency	SearchKnn QPS	SeachKnn Latency
16	200	4093	0.245ms	55134	0.018ms
16	500	1853	0.538ms	53650	0.019ms
32	200	1727	0.579ms	25993	0.038ms
32	500	894	1.122ms	25914	0.039ms

Git commit: 2a4cab5
Test with hnsw lib from after support DataLevel0BlocksMemory and the use_small_blocks_memory parameter is set to false.

M	ef_construction	addPoint QPS	addPoint Latency	SearchKnn QPS	SearchKnn Latency
16	200	4148	0.241ms	55456	0.018ms
16	500	1888	0.532ms	54123	0.019ms
32	200	1743	0.573ms	26035	0.038ms
32	500	894	1.120ms	25821	0.039ms

For the accuracy of the test, the above set of test data is the average value taken after I tested three sets of data.

I don't think adding an if judgment will significantly affect performance, because the cost of the if judgment is negligible relative to the computation of the vector itself.

Axlgrep · 2024-08-06T07:08:30Z

@yurymalkov Please take a look at the performance test data here

yurymalkov · 2024-08-11T02:35:19Z

Thank you so much for the test!

After discussions with @dyashuni we think 1-1.5% performance hit should be fine for now (an alternative it to make it a template, but it might slow down the compilation and install time for the pip package).

I am going to review the PR in much more detail (one thing that now is evident is different style of naming compared to the library).

I also wonder if you can implement automatic allocation of a new block when the new element is added, but there is no more room, I think that is something many people asked for. I guess that is adding synchronization lock logic and might be even just calling resize at https://github.dev/Axlgrep/hnswlib/blob/2a4cab5dce364587b1f9c64dcd7532c4f47f9f24/hnswlib/hnswalg.h#L1391-L1393 (I guess that would require a global lock and a counter on the unfinished ops or a read-write lock).

support DataLevel0BlocksMemory data struct to solve the realloc huge …

51450b0

…memory doubling problem

add dataLevel0BlocksMemory test case

2a4cab5

Axlgrep force-pushed the optimize_memory_allocation branch from 58b7344 to 2a4cab5 Compare July 31, 2024 12:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support DataLevel0BlocksMemory data struct #577

support DataLevel0BlocksMemory data struct #577

Axlgrep commented Jul 27, 2024 •

edited

Loading

yurymalkov commented Jul 29, 2024

Axlgrep commented Jul 29, 2024 •

edited

Loading

Axlgrep commented Jul 31, 2024 •

edited

Loading

Axlgrep commented Aug 6, 2024

yurymalkov commented Aug 11, 2024

support DataLevel0BlocksMemory data struct #577

Are you sure you want to change the base?

support DataLevel0BlocksMemory data struct #577

Conversation

Axlgrep commented Jul 27, 2024 • edited Loading

Performance

Environment

Test data

addPoint Performance

searchKnn Performance

yurymalkov commented Jul 29, 2024

Axlgrep commented Jul 29, 2024 • edited Loading

Axlgrep commented Jul 31, 2024 • edited Loading

Axlgrep commented Aug 6, 2024

yurymalkov commented Aug 11, 2024

Axlgrep commented Jul 27, 2024 •

edited

Loading

Axlgrep commented Jul 29, 2024 •

edited

Loading

Axlgrep commented Jul 31, 2024 •

edited

Loading