Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU limit slowing 30B, memory pool limit #154

Closed
clover1980 opened this issue Mar 25, 2023 · 7 comments
Closed

CPU limit slowing 30B, memory pool limit #154

clover1980 opened this issue Mar 25, 2023 · 7 comments

Comments

@clover1980
Copy link

First of all guys i want to thank you for bringing such great instrument into people's hands. Half of planet countries blocked from ChatGPT already, many still forgetting this.
Secondly, i advice everyone not to waste time with 7 and 13 models, real ChatGPT experience started only from 30B model, it can hold discussion pattern, have somewhat short memory of spoken earlier things and if you shame it by mistakes (like it can't determine the current time always) it can make all into joke (13B model can't do anything of this).
I need to say it's incredibly optimized, i wasn't able to run a GPT2 1.5 billionth model even on GPU for comparison, only 774million. 13B somewhat equal in amount of gibberish to 774mln GPT2 by my opinion.

Now about problems. There's certainly present a CPU limit, maybe for low-end hardware (because with faster speed it's use of Ram will grow also faster, 30B growing to 24-25Gb)? On 13B model it used 17% of CPU and on 30B model it continues to use only 17% CPU max. This limit seriously ruined all experience with 30B by making it slower x2 than 13B in response time and even writing the words speed (it's writing like some ancient IBM machine). For powerful hardware limit must be removed, i have plenty resources with 128Gb RAM in quad channel mode, 14 cores Xeon (30B uses on my machine with Google Chrome totally 20% of RAM).
But i don't see any way to remove CPU limit, your files are pure machine code.
Also there's some limit with memory which makes 30B crashing after some volume of work, it always abruptly ending the discussion like on 5-7th prompt with this message:
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 537269808, available 536870912)

@Path-Seeker
Copy link

Run it with option -t and amount of threats you want. Yesterday I had the same issue with my 13700K, but after running it with 20 threats, it's actually a lot faster

@Terristen
Copy link

I don't know what you mean by a CPU limit, but I also have had identical memory crashes with the error you listed. I'm running 64GB ram and during session I'm seeing 90%+ utilization of memory. On the CPU side, using 20 threads of my 24 available made 30B more usable. There's still delay, for sure.

The memory issue did not change trying to allocate more swap file space. (Though I didn't expect it to.)

I wonder if people with more base ram can have longer sessions before the memory exit.

In the spirit of the OP question, is there any way to run these models on GPUs locally instead of the CPU?

@Patjwmiller
Copy link

I don't know what you mean by a CPU limit, but I also have had identical memory crashes with the error you listed. I'm running 64GB ram and during session I'm seeing 90%+ utilization of memory. On the CPU side, using 20 threads of my 24 available made 30B more usable. There's still delay, for sure.

The memory issue did not change trying to allocate more swap file space. (Though I didn't expect it to.)

I wonder if people with more base ram can have longer sessions before the memory exit.

In the spirit of the OP question, is there any way to run these models on GPUs locally instead of the CPU?

as of right now, the only repo that I know of that currently supports GPU is: https://github.com/tloen/alpaca-lora

@clover1980
Copy link
Author

clover1980 commented Mar 25, 2023

Run it with option -t and amount of threats you want. Yesterday I had the same issue with my 13700K, but after running it with 20 threats, it's actually a lot faster

Thanks, -t 20 helps with speed, it uses now 81% of CPU and much faster in reaction and writing. (CPU temp now 80 degrees but it's normal for Xeon and my 8 pipes radiator)
30B is a gold, it's mocking OpenAi telling me to contact their support for problems and quite interesting info about Microsoft's XiaoIce :)

The only left is memory pool problem, on 18th prompt it crashing
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 536905968, available 536870912)

30B answer: You can increase your memory pool by using a larger GPU.

My specs: Win 10 1903 x64, all default. Asrock X99, Intel Xeon 2.20Ghz 14 cores 28 threads engineering sample from Ali, 128Gb RAM from 32x4, RTX 2070 Super, 2 x some 256Gb SSD.

@Path-Seeker
Copy link

Path-Seeker commented Mar 25, 2023

Thanks, -t 20 helps with speed, it uses now 81% of CPU and much faster in reaction and writing. (CPU temp now 80 degrees but it's normal for Xeon and my 8 pipes radiator) 30B is a gold, it's mocking OpenAi telling me to contact their support for problems and quite interesting info about Microsoft's XiaoIce :)

You can state any amount of threads, I have a 12-core CPU with 24 threats, so I'm using -t 20 (to leave 4 threats for the system and other apps), with your 28 threats available you can probably use 24 threats for Alpaca (or even 28, but I didn't try the max amount of threads, not sure what will happen), which should increase the performance (and CPU load)

The only left is memory pool problem, on 18th prompt it crashing ggml_new_tensor_impl: not enough space in the context's memory pool (needed 536905968, available 536870912)

I faced this problem as well, but never went deeper into the investigation. Not sure, but probably these pull requests could be a fix:
#142
#126

@jeffwadsworth
Copy link

Hmm. For the GPU version mentioned above, wouldn't you have to have one of those A100's with 80GB to utilize the 30B model? If not, that would be incredible. The speed using my 12 core AMD is fine, though. It has a short memory, but its reasoning skills and story-telling are amazing. It doesn't hallucinate as much as the 7 and 13's do, which is nice. It also has a keen sense of humor with the smile faces if asked a quirky question. Can't wait to see what they come up with in a year.

@a904guy
Copy link

a904guy commented Mar 27, 2023

From llama.cpp

Memory/Disk Requirements

As the models are currently fully loaded into memory, you will need adequate disk space to save them
and sufficient RAM to load them. At the moment, memory and disk requirements are the same.

model original size quantized size (4-bit)
7B 13 GB 3.9 GB
13B 24 GB 7.8 GB
30B 60 GB 19.5 GB
65B 120 GB 38.5 GB

ItsPi3141 pushed a commit to ItsPi3141/alpaca.cpp that referenced this issue Apr 3, 2023
…ntimatter15#154) (ggerganov#294)

* Use F16 for memory_k and memory_v

* add command line switch to use f16 instead of f32 for memory k+v

---------

Co-authored-by: Ty Everett <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants