[BUG] Llama2 and some Mistral based ggufs seem to not work. #96

docfail · 2025-01-30T08:17:22Z

Describe the bug
I have tried several models and many of them immediately crash, reporting
<nobodywho::NobodyWhoChat as godot_core::gen::classes::node::re_export::INode>::physics_process: Model worker crashed: Lama.cpp failed fetching chat template: the model has no meta val - returned code -1
<C++ Source> src\lib.rs:152 @ <nobodywho::NobodyWhoChat as godot_core::gen::classes::node::re_export::INode>::physics_process()

To Reproduce
Steps to reproduce the behavior:

Create the basic chat example.
Use this model (or a number of other models from what I can tell. unfortunately I don't really understand this stuff deeply enough to know the pattern of what works and what doesn't, just trying to experiment with LLMs in Godot using models I have been recommended)
Attempt to run the scene
Model crash occurs almost immediately

Or preferably a godot .tscn file with inbuilt scripts (this can be achieved by checking the built in checkbox when adding a script to a node)
Sure thing:

[gd_scene load_steps=2 format=3 uid="uid://dp1nwfabthckb"]

[sub_resource type="GDScript" id="GDScript_vtuoc"]
resource_name = "test.gd"
script/source = "extends NobodyWhoChat


# Called when the node enters the scene tree for the first time.
func _ready():
	# configure node
	model_node = get_node(\"NobodyWhoModel\")
	system_prompt = \"You are an evil wizard. Always try to curse anyone who talks to you.\"

	# say something
	say(\"Hi there! Who are you?\")

	# wait for the response
	var response = await response_finished
	print(\"Got response: \" + response)


func _on_response_updated(new_token):
	print(new_token)
"

[node name="Test" type="NobodyWhoChat" node_paths=PackedStringArray("model_node")]
model_node = NodePath("NobodyWhoModel")
script = SubResource("GDScript_vtuoc")

[node name="NobodyWhoModel" type="NobodyWhoModel" parent="."]
model_path = "res://models/Fimbulvetr-11B-v2-Q4_K_M-imat.gguf"

[connection signal="response_updated" from="." to="." method="_on_response_updated"]

Expected behavior
The model should run, and return a response, without crashing, just like it does when using the recommended Gemma 2B model.

Environment:

OS: Windows
Godot Version: 4.3
NobodyWho Version: Whatever is currently on the AssetLib
LLM Model: Total crapshoot. Every Llama 2 based gguf I have tried has failed, same with most Mistral ggufs I tried. I did get lucky with a Llama 3 based gguf, one particularly large Mistral gguf, and the Gemma 2B gguf suggested. One Llama 2 one even crashed so hard it ends the debug session so I don't even know what the error is.

Additional context
I know basically all of the GGUFs I used should work(if it even makes sense for one to "not work" at all), as I have seen them functioning very well on other backends on this very same machine.

The text was updated successfully, but these errors were encountered:

AsbjornOlling · 2025-01-30T14:20:19Z

The reason these GGUFs don't work is because their metadata doesn't include a chat template.

I think that this is a relatively new thing for GGUF, so models that are old-ish (like the ~1yr old model you linked) generally won't work.

Other backends (i.e. the llama.cpp server and ollama) don't actually implement the chat templates properly, but instead they try to detect what model is being used, and then they have their own template for each supported model. These templates are not exactly the same as the ones that are intended for the model, but just something close-enough. I strongly suspect that this results in sub-par responses, compared to using the proper upstream templates.

We did discuss falling back to some kind of default template, in cases where the provided GGUF file doesn't provide one. I'm leaning towards this being a bad idea. Silently applying a somewhat-random chat template to a bunch of models could result in them outputting subpar responses, without the user noticing.
I guess it might be okay to do this, and then also emit a warning, so the user knows that it's gonna act a bit weird.

A relevant question is why you're interested in using "old" models like llama2. My impression is that the llama3 models are strictly better in all regards.

AsbjornOlling · 2025-01-30T14:22:52Z

Maybe the error message could be improved? We could add a message like:

"It looks like your GGUF model doesn't include a chat template. Could it be that you are using an older model? Try using a model newer than , or manually add a chat template to the GGUF metadata using gguf-py."

docfail · 2025-01-30T17:10:11Z

These were all models that came pretty heavily recommended for roleplay, which is more or less the exact desired behavior of "pretend to be an NPC". They also seem to perform pretty well in general. I'm sure the new models probably all work better as you've mentioned but being new means, they're not coming as well recommended because they're not as "tried and tested".

The one Llama 3 model I have performs incredibly sluggishly(near unusably so) with this extension for some reason, despite being pretty speedy using a different local chat client running the same file, not sure what the deal is there?

I would agree that the error message should be improved if you're not intending to support older models in any form, that way users of the extension are aware that this isn't a bug but rather intended functionality, and that they'll need to find some newer models(maybe providing guidance on where to find "new" models or how to tell which are valid would also be helpful in this regard since these are large files and it'd be a pain to find and download one only to find it doesn't work)

If possible, it would be good if the extension could prescan the model to see if it IS supported at the time it is selected, and display something in the editor instead of at runtime. I'm not sure what the format of these files is but if the information is early in the header this shouldn't be too difficult to do.

AsbjornOlling · 2025-02-05T12:49:13Z

The one Llama 3 model I have performs incredibly sluggishly(near unusably so) with this extension for some reason, despite being pretty speedy using a different local chat client running the same file, not sure what the deal is there?

I'm interested in this, but it doesn't seem related to this bug report.

Maybe open a new one, or hop in the group chat, if you would prefer to talk about it more informally. I would like to more about this.
Please also let me know what hardware, what the parameter count and quantization levels of the model are, and what "a different local chat client" is.

AsbjornOlling · 2025-02-05T12:52:14Z

Closing this issue because the library works as intended.

We don't support models that don't include a chat template, and the error message when loading the model clearly states that it fails because it fails to fetch a chat template.

If you continue to believe that this is a bug, and more should be done to address it, feel free to open this issue with a suggestion on what should be done.

docfail · 2025-02-06T09:09:28Z

Let me preface this by saying that I don't have permission to reopen this issue, so I may need to open a new issue for this if there's no response.

Closing this issue because the library works as intended.

We don't support models that don't include a chat template, and the error message when loading the model clearly states that it fails because it fails to fetch a chat template.

If you continue to believe that this is a bug, and more should be done to address it, feel free to open this issue with a suggestion on what should be done.

I have no issue with not supporting models that don't include a chat template, but I disagree entirely with the assertion that the error message "clearly states that it fails because it fails to fetch a chat template", and even more so with the closing of the issue.
Firstly, even if it were clearly stated that that is the issue, there is no clear direction to the end user as to what that means, or what their next actions should be. There is no information to let them know if that is a model incompatibility or if the system failed for some inscrutable(to them) reason.
But beyond that, "Lama.cpp failed fetching chat template: the model has no meta val - returned code -1" does not clearly state that it failed because it failed to fetch a chat template. It states that it failed because the model "the model has no meta val - returned code -1". The "failed fetching chat template" being immediately followed by contextualization information implies that there's a deeper issue. Error messages aren't traditionally formatted to be parsed otherwise.
That all being said, I'd like to emphasize that it's more or less entirely irrelevant if the error message conveys that the chat template failed to fetch to the end user.

The core of the issue here is that if users are to receive an error message that isn't the result of a bug, the message needs to be actionable within a reasonable expectation of user understanding. The fact that models that don't have chat templates are not supported is not clearly stated. Beyond that, even if it were clearly stated: it's not clear how to even tell if a model has a chat template. From a quick search, it looks like you'd need dedicated tools to inspect it to find out?

If the issue is that there is not a chat template present, then simply and clearly state in the error message "This model does not contain a chat template, models without chat templates are not supported." That information cannot be inferred from the error message in its current form without a lot of unsafe assumptions on the user's part. This won't make it any less frustrating to then need to go digging to find a model that does have a chat template without knowledge of which ones do and do not; however, at least that is now googleable in some way, shape or form.

TL;DR: Change the error message. In it's current state, it's completely inscrutable without knowledge of either the inner workings of the extension, the structure of the model, or both.

docfail · 2025-02-06T09:38:35Z

The one Llama 3 model I have performs incredibly sluggishly(near unusably so) with this extension for some reason, despite being pretty speedy using a different local chat client running the same file, not sure what the deal is there?

I'm interested in this, but it doesn't seem related to this bug report.

Maybe open a new one, or hop in the group chat, if you would prefer to talk about it more informally. I would like to more about this. Please also let me know what hardware, what the parameter count and quantization levels of the model are, and what "a different local chat client" is.

I'm more than happy to provide any information I can. I'm not sure what you mean by "the group chat", or I would gladly hop in and explain. For the time being though I'll at least answer those questions provided.

What hardware:

RTX 4080, 64G GPRAM, 14900K

what the parameter count and quantization levels of the model are:

I don't know how to answer these questions at present, or at least without just guessing or assuming. I can look into it. However, in the interest of answering the questions while they're still hot-I have already provided the model in a different issue so I'll just provide you the same link: llama3.8b.hathor_fractionate-l3-v.05.gguf_v2.q8_0.gguf

different local chat client:

Backyard.ai - formerly known as Faraday from what I understand

AsbjornOlling · 2025-02-06T10:21:10Z

Let me preface this by saying that I don't have permission to reopen this issue, so I may need to open a new issue for this if there's no response.

Ah damn. I guess I didn't realize how permissions work on github issues.

I have no issue with not supporting models that don't include a chat template, but I disagree entirely with the assertion that the error message "clearly states that it fails because it fails to fetch a chat template"

In it's current state, it's completely inscrutable without knowledge of either the inner workings of the extension, the structure of the model, or both.

That's totally fair. I guess this is a "curse of knowledge" thing, where it only becomes clear if one has spent most of the past months staring at llama.cpp errors and taking apart GGUF files 😅

The point of NobodyWho is precisely to let people using local LLMs without understanding the internals of llama.cpp.

Let's improve the error message.

AsbjornOlling · 2025-02-06T11:54:58Z

Backyard.ai - formerly known as Faraday from what I understand

Hm, it's proprietary and doesn't run on linux, so it's a bit more difficult for me to examine closely. I wonder if they're using cuda on nvidia machines. I expect cuda to perform somewhat better than nvidia, so that could be it.

I'm not sure what you mean by "the group chat", or I would gladly hop in and explain.

The "group chat" I'm referring to is the matrix or discord chat that we link in the README. Feel free to use that if you want to, but github issues is also totally fine.

AsbjornOlling · 2025-02-06T11:57:08Z

Also I wrote a much more detailed error message for failing to fetch chat templates. I hope you agree that this is more actionable. Let me know if you disagree.

docfail · 2025-02-06T12:17:17Z

Also I wrote a much more detailed error message for failing to fetch chat templates. I hope you agree that this is more actionable. Let me know if you disagree.

Just took a look, it's leagues better! Explains the issue, likely cause, and even provides actionable information for next steps as an end user. Checks all the boxes. Thank you very much!

docfail · 2025-02-06T12:27:02Z

I'm not sure what you mean by "the group chat", or I would gladly hop in and explain.

The "group chat" I'm referring to is the matrix or discord chat that we link in the README. Feel free to use that if you want to, but github issues is also totally fine.

Ah, excellent that makes sense, I'll probably pop into one of those chats then.

docfail added the bug Something isn't working label Jan 30, 2025

docfail changed the title ~~[BUG] Llama based ggufs seem to not work.~~ [BUG] Llama2 and some Mistral based ggufs seem to not work. Jan 30, 2025

AsbjornOlling closed this as completed Feb 5, 2025

AsbjornOlling reopened this Feb 6, 2025

AsbjornOlling mentioned this issue Feb 6, 2025

Improve WorkerError #104

Merged

10 tasks

AsbjornOlling closed this as completed in #104 Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Llama2 and some Mistral based ggufs seem to not work. #96

[BUG] Llama2 and some Mistral based ggufs seem to not work. #96

docfail commented Jan 30, 2025

AsbjornOlling commented Jan 30, 2025

AsbjornOlling commented Jan 30, 2025

docfail commented Jan 30, 2025 •

edited

Loading

AsbjornOlling commented Feb 5, 2025

AsbjornOlling commented Feb 5, 2025

docfail commented Feb 6, 2025 •

edited

Loading

docfail commented Feb 6, 2025 •

edited

Loading

AsbjornOlling commented Feb 6, 2025

AsbjornOlling commented Feb 6, 2025

AsbjornOlling commented Feb 6, 2025

docfail commented Feb 6, 2025 •

edited

Loading

docfail commented Feb 6, 2025

[BUG] Llama2 and some Mistral based ggufs seem to not work. #96

[BUG] Llama2 and some Mistral based ggufs seem to not work. #96

Comments

docfail commented Jan 30, 2025

AsbjornOlling commented Jan 30, 2025

AsbjornOlling commented Jan 30, 2025

docfail commented Jan 30, 2025 • edited Loading

AsbjornOlling commented Feb 5, 2025

AsbjornOlling commented Feb 5, 2025

docfail commented Feb 6, 2025 • edited Loading

docfail commented Feb 6, 2025 • edited Loading

What hardware:

what the parameter count and quantization levels of the model are:

different local chat client:

AsbjornOlling commented Feb 6, 2025

AsbjornOlling commented Feb 6, 2025

AsbjornOlling commented Feb 6, 2025

docfail commented Feb 6, 2025 • edited Loading

docfail commented Feb 6, 2025

docfail commented Jan 30, 2025 •

edited

Loading

docfail commented Feb 6, 2025 •

edited

Loading

docfail commented Feb 6, 2025 •

edited

Loading

docfail commented Feb 6, 2025 •

edited

Loading