Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support DataLevel0BlocksMemory data struct #577

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

Axlgrep
Copy link

@Axlgrep Axlgrep commented Jul 27, 2024

support DataLevel0BlocksMemory data struct to solve the realloc huge memory doubling problem

  1. The current data_level0_memory_ continuous memory block design can result in memory doubling during large index resize, which is very unfriendly to memory and performance(the realloc operation may trigger an expensive data copy action, see the quote below),so we redesigned data_level0_memory_ (using multiple small memory blocks linked together).
  2. In addition, we found some obvious _mm_prefetch out-of-bounds access issues and fixed them (there may be other hidden issues, please check them as well).

realloc

Reallocates the given area of memory. If ptr is not NULL, it must be previously allocated by malloc, calloc or realloc and not yet freed with a call to free or realloc. Otherwise, the results are undefined.
The reallocation is done by either:

  • expanding or contracting the existing area pointed to by ptr, if possible. The contents of the area remain unchanged up to the lesser of the new and old sizes. If the area is expanded, the contents of the new part of the array are undefined.
  • allocating a new memory block of size new_size bytes, copying memory area with size equal the lesser of the new and the old sizes, and freeing the old block.

Performance

Environment

All of the benchmarks are run on the same Tencent instance. Here are the details of the test setup:

  • Instance type: S5.12XLARGE128, 48 CPU, 128 GB Memory, 800GiB SSD.
  • Kernel version: Linux 5.4.119-1-tlinux4-0008 x86_644.

Test data

Thread Num: 48
We constructed (randomly generated) 1 million data(vector dimension: 1024) for testing.

addPoint Performance

M ef_construction QPS Latency use_small_blocks_memory
16 200 3159 0.316 ms true
16 200 3180 0.314 ms false
16 500 1429 0.699 ms true
16 500 1435 0.697 ms false
32 200 1401 0.713 ms true
32 200 1444 0.692 ms false
32 500 703 1.422 ms true
32 500 708 1.412 ms false

searchKnn Performance

M ef_construction topk QPS Latency use_small_blocks_memory
16 200 10 40209 0.025 ms true
16 200 10 40337 0.025 ms false
16 500 10 39511 0.025 ms true
16 500 10 40564 0.025 ms false
32 200 10 19507 0.051 ms true
32 200 10 20136 0.051 ms false
32 500 10 19562 0.051 ms true
32 500 10 19545 0.051 ms false

@yurymalkov Please take a look at this MR, thank you...

@yurymalkov
Copy link
Member

Thank you for the update! I wonder if you've looked into if the speed is the same before and after the code change for flat memory?
I can imagine adding an additional if can affect the speed of things. If it affects it, there are some possible workarounds.

@Axlgrep
Copy link
Author

Axlgrep commented Jul 29, 2024

Ok, I'll update the test results later.

@Axlgrep
Copy link
Author

Axlgrep commented Jul 31, 2024

Git commit: 020de1a
Test with hnsw lib from before support DataLevel0BlocksMemory .

M ef_construction addPoint QPS addPoint Latency SearchKnn QPS SeachKnn Latency
16 200 4093 0.245ms 55134 0.018ms
16 500 1853 0.538ms 53650 0.019ms
32 200 1727 0.579ms 25993 0.038ms
32 500 894 1.122ms 25914 0.039ms

Git commit: 2a4cab5
Test with hnsw lib from after support DataLevel0BlocksMemory and the use_small_blocks_memory parameter is set to false.

M ef_construction addPoint QPS addPoint Latency SearchKnn QPS SearchKnn Latency
16 200 4148 0.241ms 55456 0.018ms
16 500 1888 0.532ms 54123 0.019ms
32 200 1743 0.573ms 26035 0.038ms
32 500 894 1.120ms 25821 0.039ms

For the accuracy of the test, the above set of test data is the average value taken after I tested three sets of data.

I don't think adding an if judgment will significantly affect performance, because the cost of the if judgment is negligible relative to the computation of the vector itself.

@Axlgrep Axlgrep force-pushed the optimize_memory_allocation branch from 58b7344 to 2a4cab5 Compare July 31, 2024 12:16
@Axlgrep
Copy link
Author

Axlgrep commented Aug 6, 2024

@yurymalkov Please take a look at the performance test data here

@yurymalkov
Copy link
Member

Thank you so much for the test!

After discussions with @dyashuni we think 1-1.5% performance hit should be fine for now (an alternative it to make it a template, but it might slow down the compilation and install time for the pip package).

I am going to review the PR in much more detail (one thing that now is evident is different style of naming compared to the library).

I also wonder if you can implement automatic allocation of a new block when the new element is added, but there is no more room, I think that is something many people asked for. I guess that is adding synchronization lock logic and might be even just calling resize at https://github.dev/Axlgrep/hnswlib/blob/2a4cab5dce364587b1f9c64dcd7532c4f47f9f24/hnswlib/hnswalg.h#L1391-L1393 (I guess that would require a global lock and a counter on the unfinished ops or a read-write lock).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants