-
Notifications
You must be signed in to change notification settings - Fork 13k
Add Olmo3 implementation #16015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add Olmo3 implementation #16015
Conversation
I used the model conversion example for testing. I got the following results when using bf16 on shanearora/2025-sep-a-base-model, modified to have yarn rope scaling enabled.
Also, below is the allenai/OLMo-2-0425-1B with fp32.
|
src/llama-model.cpp
Outdated
if (is_swa) { | ||
// For sliding window layers, Olmo3 does not use rope scaling. | ||
// This is achieved here by setting freq_scale and attn_factor to 1. | ||
// We also set ext_factor to 0 to avoid a few unnecessary computations. | ||
Qcur = ggml_rope_ext( | ||
ctx0, Qcur, inp_pos, nullptr, | ||
n_rot, rope_type, n_ctx_orig, freq_base, 1.0, | ||
0.0, 1.0, beta_fast, beta_slow | ||
); | ||
|
||
Kcur = ggml_rope_ext( | ||
ctx0, Kcur, inp_pos, nullptr, | ||
n_rot, rope_type, n_ctx_orig, freq_base, 1.0, | ||
0.0, 1.0, beta_fast, beta_slow | ||
); | ||
} | ||
else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (is_swa) { | |
// For sliding window layers, Olmo3 does not use rope scaling. | |
// This is achieved here by setting freq_scale and attn_factor to 1. | |
// We also set ext_factor to 0 to avoid a few unnecessary computations. | |
Qcur = ggml_rope_ext( | |
ctx0, Qcur, inp_pos, nullptr, | |
n_rot, rope_type, n_ctx_orig, freq_base, 1.0, | |
0.0, 1.0, beta_fast, beta_slow | |
); | |
Kcur = ggml_rope_ext( | |
ctx0, Kcur, inp_pos, nullptr, | |
n_rot, rope_type, n_ctx_orig, freq_base, 1.0, | |
0.0, 1.0, beta_fast, beta_slow | |
); | |
} | |
else { | |
if (!is_swa) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if block is needed. For SWA layers Olmo2 uses standard rope. Removing that if block would remove rope on SWA layers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6997fad Clarified comment slightly
Co-authored-by: Sigbjørn Skjæret <[email protected]>
This PR adds the upcoming Olmo 3. The main architectural differences from Olmo 2 are:
Since the architecture is very similar to Olmo 2, this PR opts to merge Olmo 3 changes into the Olmo 2 implementation (similar to vllm-project/vllm#24534). I can create a separate Olmo 3 implementation instead if preferred.