Skip to content

Conversation

2015aroras
Copy link
Contributor

@2015aroras 2015aroras commented Sep 15, 2025

This PR adds the upcoming Olmo 3. The main architectural differences from Olmo 2 are:

  • Sliding window attention is used for 3 out of 4 layers. RoPE scaling is not applied to sliding window attention layers.

Since the architecture is very similar to Olmo 2, this PR opts to merge Olmo 3 changes into the Olmo 2 implementation (similar to vllm-project/vllm#24534). I can create a separate Olmo 3 implementation instead if preferred.

@github-actions github-actions bot added the python python script changes label Sep 15, 2025
@2015aroras 2015aroras marked this pull request as ready for review September 15, 2025 20:08
@2015aroras
Copy link
Contributor Author

2015aroras commented Sep 15, 2025

I used the model conversion example for testing. I got the following results when using bf16 on shanearora/2025-sep-a-base-model, modified to have yarn rope scaling enabled.

📈 METRICS
==============================
MSE (Mean Squared Error):     1.592396e-02
Reference Variance:           6.831117e+00
NMSE:                         2.331092e-03
Max Absolute Error:           0.438750
Mean Absolute Error:          0.116665
NMSE (dB):                    -26.32 dB

🎯 INTERPRETATION
==============================
👍 Good match

📋 GUIDANCE
==============================
👍 GOOD: Your GGML conversion is working well.
   Small differences are likely due to precision/quantization.

📚 NMSE BENCHMARKS
==============================
✅ RESULT: PASS (NMSE = 2.33e-03)

Also, below is the allenai/OLMo-2-0425-1B with fp32.

📈 METRICS
==============================
MSE (Mean Squared Error):     1.594746e-03
Reference Variance:           9.219801e+00
NMSE:                         1.729697e-04
Max Absolute Error:           0.168732
Mean Absolute Error:          0.033951
NMSE (dB):                    -37.62 dB

🎯 INTERPRETATION
==============================
👍 Very good match

📋 GUIDANCE
==============================
✅ EXCELLENT: Your GGML conversion is working very well!
   The differences are negligible for practical use.

📚 NMSE BENCHMARKS
==============================
✅ RESULT: PASS (NMSE = 1.73e-04)

Comment on lines 12218 to 12234
if (is_swa) {
// For sliding window layers, Olmo3 does not use rope scaling.
// This is achieved here by setting freq_scale and attn_factor to 1.
// We also set ext_factor to 0 to avoid a few unnecessary computations.
Qcur = ggml_rope_ext(
ctx0, Qcur, inp_pos, nullptr,
n_rot, rope_type, n_ctx_orig, freq_base, 1.0,
0.0, 1.0, beta_fast, beta_slow
);

Kcur = ggml_rope_ext(
ctx0, Kcur, inp_pos, nullptr,
n_rot, rope_type, n_ctx_orig, freq_base, 1.0,
0.0, 1.0, beta_fast, beta_slow
);
}
else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (is_swa) {
// For sliding window layers, Olmo3 does not use rope scaling.
// This is achieved here by setting freq_scale and attn_factor to 1.
// We also set ext_factor to 0 to avoid a few unnecessary computations.
Qcur = ggml_rope_ext(
ctx0, Qcur, inp_pos, nullptr,
n_rot, rope_type, n_ctx_orig, freq_base, 1.0,
0.0, 1.0, beta_fast, beta_slow
);
Kcur = ggml_rope_ext(
ctx0, Kcur, inp_pos, nullptr,
n_rot, rope_type, n_ctx_orig, freq_base, 1.0,
0.0, 1.0, beta_fast, beta_slow
);
}
else {
if (!is_swa) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if block is needed. For SWA layers Olmo2 uses standard rope. Removing that if block would remove rope on SWA layers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6997fad Clarified comment slightly

2015aroras and others added 2 commits September 15, 2025 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants