Convert: mixed k-quant with legacy quant fallback #447

stduhpf · 2024-10-25T13:33:22Z

Adds a new cli argument: --fallback-type.

If tensors cannot be quantized to a k-quant because of block size issues, the fallback type will be used instead of full precision.

Very useful for SD3.5 models, because 90% of SD3.5 8B weights can't be quantized to k quants.

--type q4_k --fallback-type q4_0 has always the exact same output size as --type q4_0, but with less degradation.

Somewhat adresses #446

stduhpf · 2024-10-25T14:38:47Z

I'm currently uploading quantized weights to HF, but with my cellular data, it takes very long.

here:

thxCode · 2024-11-29T06:36:45Z

sorry, I get lost in this PR. so the conclusion says a mixed (Q4_K, Q4_0) is better than Q4_1?

are we have something like (perplexity)[https://github.com/ggerganov/llama.cpp/blob/master/examples/perplexity/README.md]?

and the quantized files also confuse me, what does q4_k_4_0 mean? is it a new kind of GGML type?

can we keep following mostly file type definition?https://github.com/ggerganov/ggml/blob/d8ea053461056a5c15f071c7c5ed57d86e892750/include/ggml.h#L408-L436

stduhpf · 2024-11-29T09:15:44Z

and the quantized files also confuse me, what does q4_k_4_0 mean? is it a new kind of GGML type?

It's not exactly a new GGML type, it's simply a file type with different weight types mixed in it, kinda like the Q4_K_L, Q3_K_M, and such.

Q4_K is the main quantization type used, Q4_0 is the backup. The only rule to decide if a tensor is going to use the main type or the backup is if the shape of the tensors can fit a whole number of QK blocks or not.

In case of SD3.5 large, the resulting file ends up with more Q4_0 tensors than Q4_K

can we keep following mostly file type definition?https://github.com/ggerganov/ggml/blob/d8ea053461056a5c15f071c7c5ed57d86e892750/include/ggml.h#L408-L436

Even llama.cpp doesn't keep following this definition. https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/quantize.cpp#L18-#L59

thxCode · 2024-11-29T09:32:18Z

the --fallback-type is difficult to understand, many options below this arg, is it possible to do fallback inside? like when --type q4_k_l, apply q4_k to most tensors, and fallback to q4_0 those mismatched.

Convert: mixed k-quant with legacy fallback

63c10d1

stduhpf mentioned this pull request Nov 29, 2024

Refactor: wtype per tensor from file instead of global #455

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert: mixed k-quant with legacy quant fallback #447

Convert: mixed k-quant with legacy quant fallback #447

stduhpf commented Oct 25, 2024 •

edited

Loading

stduhpf commented Oct 25, 2024 •

edited

Loading

thxCode commented Nov 29, 2024

stduhpf commented Nov 29, 2024

thxCode commented Nov 29, 2024

Convert: mixed k-quant with legacy quant fallback #447

Are you sure you want to change the base?

Convert: mixed k-quant with legacy quant fallback #447

Conversation

stduhpf commented Oct 25, 2024 • edited Loading

stduhpf commented Oct 25, 2024 • edited Loading

thxCode commented Nov 29, 2024

stduhpf commented Nov 29, 2024

thxCode commented Nov 29, 2024

stduhpf commented Oct 25, 2024 •

edited

Loading

stduhpf commented Oct 25, 2024 •

edited

Loading