Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix typo in MoE note. #27

Merged
merged 2 commits into from
Feb 7, 2025
Merged

Fix typo in MoE note. #27

merged 2 commits into from
Feb 7, 2025

Conversation

fedelebron
Copy link
Collaborator

Also TeX'd some variables.

transformers.md Outdated
@@ -251,9 +251,9 @@ So the takeaway is that **dot-product attention FLOPs only become dominant durin

### Sparsity and Mixture-of-Experts

We'd be remiss not to briefly discuss Mixture of Experts (MoE) models<d-cite key="moe"></d-cite>, which replace the single dense MLP blocks in a standard Transformer with a set of independent MLPs that can be dynamically routed between. To a first approximation, **an MoE is a dense model with E MLP blocks per layer**, instead of just one. Each token activates k of these experts, typically k=2. This increases the parameter count by O(E), while keeping the total number of activated parameters roughly the same as the dense model.
We'd be remiss not to briefly discuss Mixture of Experts (MoE) models<d-cite key="moe"></d-cite>, which replace the single dense MLP blocks in a standard Transformer with a set of independent MLPs that can be dynamically routed between. To a first approximation, **an MoE is a dense model with E MLP blocks per layer**, instead of just one. Each token activates $k$ of these experts, typically $k=2$. This increases the parameter count by $O(E)$, while multipling the total number of activated parameters per inference by $k$, compared with the dense version.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiplying

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also "activated parameters per token"

@fedelebron fedelebron merged commit 9b1f5ea into jax-ml:main Feb 7, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants