-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU limit slowing 30B, memory pool limit #154
Comments
Run it with option -t and amount of threats you want. Yesterday I had the same issue with my 13700K, but after running it with 20 threats, it's actually a lot faster |
I don't know what you mean by a CPU limit, but I also have had identical memory crashes with the error you listed. I'm running 64GB ram and during session I'm seeing 90%+ utilization of memory. On the CPU side, using 20 threads of my 24 available made 30B more usable. There's still delay, for sure. The memory issue did not change trying to allocate more swap file space. (Though I didn't expect it to.) I wonder if people with more base ram can have longer sessions before the memory exit. In the spirit of the OP question, is there any way to run these models on GPUs locally instead of the CPU? |
as of right now, the only repo that I know of that currently supports GPU is: https://github.com/tloen/alpaca-lora |
Thanks, -t 20 helps with speed, it uses now 81% of CPU and much faster in reaction and writing. (CPU temp now 80 degrees but it's normal for Xeon and my 8 pipes radiator) The only left is memory pool problem, on 18th prompt it crashing 30B answer: You can increase your memory pool by using a larger GPU. My specs: Win 10 1903 x64, all default. Asrock X99, Intel Xeon 2.20Ghz 14 cores 28 threads engineering sample from Ali, 128Gb RAM from 32x4, RTX 2070 Super, 2 x some 256Gb SSD. |
You can state any amount of threads, I have a 12-core CPU with 24 threats, so I'm using -t 20 (to leave 4 threats for the system and other apps), with your 28 threats available you can probably use 24 threats for Alpaca (or even 28, but I didn't try the max amount of threads, not sure what will happen), which should increase the performance (and CPU load)
I faced this problem as well, but never went deeper into the investigation. Not sure, but probably these pull requests could be a fix: |
Hmm. For the GPU version mentioned above, wouldn't you have to have one of those A100's with 80GB to utilize the 30B model? If not, that would be incredible. The speed using my 12 core AMD is fine, though. It has a short memory, but its reasoning skills and story-telling are amazing. It doesn't hallucinate as much as the 7 and 13's do, which is nice. It also has a keen sense of humor with the smile faces if asked a quirky question. Can't wait to see what they come up with in a year. |
From llama.cpp Memory/Disk RequirementsAs the models are currently fully loaded into memory, you will need adequate disk space to save them
|
…ntimatter15#154) (ggerganov#294) * Use F16 for memory_k and memory_v * add command line switch to use f16 instead of f32 for memory k+v --------- Co-authored-by: Ty Everett <[email protected]>
First of all guys i want to thank you for bringing such great instrument into people's hands. Half of planet countries blocked from ChatGPT already, many still forgetting this.
Secondly, i advice everyone not to waste time with 7 and 13 models, real ChatGPT experience started only from 30B model, it can hold discussion pattern, have somewhat short memory of spoken earlier things and if you shame it by mistakes (like it can't determine the current time always) it can make all into joke (13B model can't do anything of this).
I need to say it's incredibly optimized, i wasn't able to run a GPT2 1.5 billionth model even on GPU for comparison, only 774million. 13B somewhat equal in amount of gibberish to 774mln GPT2 by my opinion.
Now about problems. There's certainly present a CPU limit, maybe for low-end hardware (because with faster speed it's use of Ram will grow also faster, 30B growing to 24-25Gb)? On 13B model it used 17% of CPU and on 30B model it continues to use only 17% CPU max. This limit seriously ruined all experience with 30B by making it slower x2 than 13B in response time and even writing the words speed (it's writing like some ancient IBM machine). For powerful hardware limit must be removed, i have plenty resources with 128Gb RAM in quad channel mode, 14 cores Xeon (30B uses on my machine with Google Chrome totally 20% of RAM).
But i don't see any way to remove CPU limit, your files are pure machine code.
Also there's some limit with memory which makes 30B crashing after some volume of work, it always abruptly ending the discussion like on 5-7th prompt with this message:
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 537269808, available 536870912)
The text was updated successfully, but these errors were encountered: