Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert: mixed k-quant with legacy quant fallback #447

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

stduhpf
Copy link
Contributor

@stduhpf stduhpf commented Oct 25, 2024

Adds a new cli argument: --fallback-type.

If tensors cannot be quantized to a k-quant because of block size issues, the fallback type will be used instead of full precision.

Very useful for SD3.5 models, because 90% of SD3.5 8B weights can't be quantized to k quants.

--type q4_k --fallback-type q4_0 has always the exact same output size as --type q4_0, but with less degradation.

Somewhat adresses #446

@stduhpf
Copy link
Contributor Author

stduhpf commented Oct 25, 2024

I'm currently uploading quantized weights to HF, but with my cellular data, it takes very long.

here:

@thxCode
Copy link

thxCode commented Nov 29, 2024

sorry, I get lost in this PR. so the conclusion says a mixed (Q4_K, Q4_0) is better than Q4_1?

are we have something like (perplexity)[https://github.com/ggerganov/llama.cpp/blob/master/examples/perplexity/README.md]?

and the quantized files also confuse me, what does q4_k_4_0 mean? is it a new kind of GGML type?

can we keep following mostly file type definition?https://github.com/ggerganov/ggml/blob/d8ea053461056a5c15f071c7c5ed57d86e892750/include/ggml.h#L408-L436

@stduhpf
Copy link
Contributor Author

stduhpf commented Nov 29, 2024

and the quantized files also confuse me, what does q4_k_4_0 mean? is it a new kind of GGML type?

It's not exactly a new GGML type, it's simply a file type with different weight types mixed in it, kinda like the Q4_K_L, Q3_K_M, and such.

Q4_K is the main quantization type used, Q4_0 is the backup. The only rule to decide if a tensor is going to use the main type or the backup is if the shape of the tensors can fit a whole number of QK blocks or not.

In case of SD3.5 large, the resulting file ends up with more Q4_0 tensors than Q4_K

can we keep following mostly file type definition?https://github.com/ggerganov/ggml/blob/d8ea053461056a5c15f071c7c5ed57d86e892750/include/ggml.h#L408-L436

Even llama.cpp doesn't keep following this definition. https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/quantize.cpp#L18-#L59

@thxCode
Copy link

thxCode commented Nov 29, 2024

the --fallback-type is difficult to understand, many options below this arg, is it possible to do fallback inside? like when --type q4_k_l, apply q4_k to most tensors, and fallback to q4_0 those mismatched.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants