Uploaded by Bảo Nguyễn

Basic MoE (1)

advertisement
Overview of Mixture of Experts (MoE) in
the Transformer-based models
AIO
Table of contents
1. Introduction to Mixture of Experts (MoE)
2. Basic taxonomy on MoEs
3. MoE Mechanism
4. Discussion
Introduction to Mixture of Experts (MoE)
What is Mixture of Experts?
“Mixture of experts (MoE) is a machine learning approach that divides an artificial intelligence
(AI) model into separate sub-networks (or “experts”), each specializing in a subset of the
input data, to jointly perform a task.”
- from IBM
A system of expert and gating networks. (Left and right figures retrieved from [1], [2] respectively)
Introduction to Mixture of Experts (MoE)
What is Mixture of Experts?
“Mixture of experts (MoE) is a machine learning approach that divides an artificial intelligence
(AI) model into separate sub-networks (or “experts”), each specializing in a subset of the
input data, to jointly perform a task.”
- from IBM
A system of expert and gating networks. (Left and right figures retrieved from [1], [2] respectively)
Introduction to Mixture of Experts (MoE)
What does MoE have?
MoE consists of two main elements:
1. Sparse MoE layers
These layers are alternative to feed-forward network (FFN) layers.
Comparing a dense and sparse expert Transformer.
Figure retrieved from [6]
Introduction to Mixture of Experts (MoE)
What does MoE have?
MoE consists of two main elements:
1. Sparse MoE layers
These layers are alternative to feed-forward
network (FFN) layers.
MoE layers have a certain number of
“experts” (e.g. 8), where each expert is a
neural network.
In practice, the experts are FFNs, but they can
also be more complex networks or even a
MoE itself, leading to hierarchical MoEs!
Switch Transformer encoder block replace the dense FFN layer
with a sparse Switch FFN layer (light blue).
Figure retrieved from [3].
Introduction to Mixture of Experts (MoE)
What does MoE have?
MoE consists of two main elements:
MoE layer
1. Sparse MoE layers
These layers are alternative to feed-forward
network (FFN) layers.
MoE layers have a certain number of
“experts” (e.g. 8), where each expert is a
neural network.
In practice, the experts are FFNs, but they can
also be more complex networks or even a
MoE itself, leading to hierarchical MoEs!
Switch Transformer encoder block replace the dense FFN layer
with a sparse Switch FFN layer (light blue).
Figure retrieved from [3].
Introduction to Mixture of Experts (MoE)
What does MoE have?
MoE consists of two main elements:
2. A gate network or router
How is the expert chosen? A gating network
determines the weights for each expert.
E.g: This component determines which tokens
are sent to which expert.
How to route a token to an expert is one of
the big decisions when working with MoEs.
→ The router is composed of learned
parameters and is pretrained at the same
time as the rest of the network.
Switch Transformer encoder block replace the dense FFN layer
with a sparse Switch FFN layer (light blue).
Figure retrieved from [3].
Basic taxonomy on MoEs
The roots of MoEs come from paper Adaptive Mixture of Local Experts by Jacob 1991 [1]. The
idea, akin to ensemble methods, was to have a supervised procedure for a system composed of
separate networks, each handling a different subset of the training cases.
Two different research areas contributed to MoE
advancement:
1. Experts as components
In the traditional MoE setup, the whole system
comprises a gating network and multiple experts. This
allows having MoEs as layers in a multilayer network,
making it possible for the model to be both large and
efficient simultaneously.
Deep Mixture of Experts [4]
Basic taxonomy on MoEs
The roots of MoEs come from paper Adaptive Mixture of Local Experts by Jacob 1991 [1]. The
idea, akin to ensemble methods, was to have a supervised procedure for a system composed of
separate networks, each handling a different subset of the training cases.
Two different research areas contributed to MoE
advancement:
2. Conditional Computation:
Traditional networks process all input data through
every layer. In this period, Yoshua Bengio researched
approaches to dynamically activate or deactivate
components based on the input token.
MoE Mechanisms
What is sparsity?
Sparsity uses the idea of conditional computation. While in dense models all the parameters are
used for all the inputs, sparsity allows us to only run some parts of the whole system.
Leverage idea of conditional computational:
→ allows scaling the size of the model without
increasing the computation.
→ allows thousands of experts being used in each
MoE layer.
MoE Mechanisms
A learned gating network (G) decides which experts (E) to send a part of the input:
What if G equals to 0?
MoE Mechanisms
A learned gating network (G) decides which experts (E) to send a part of the input:
What if G equals to 0?
→ There’s no need to compute respective expert to
save computational cost.
MoE Mechanisms
A learned gating network (G) decides which experts (E) to send a part of the input:
What if G equals to 0?
→ There’s no need to compute respective expert to
save computational cost.
So, how will define the gating function?
MoE Mechanisms
With input x, A learned gating network (G) decides which experts (E) to send a part of the input:
In the most traditional setup, we just use a simple network with a softmax function.
The network will learn which expert to send the input.
Learnable parameters for
gating to choose expert
MoE Mechanisms
In the most traditional setup, we just use a simple network with a softmax function.
The network will learn which expert to send the input.
Learnable parameters for
gating to choose expert
MoE Mechanisms
Another work for gating proposed in Shazeer [5] is using Noisy Top-k Gating. This
gating approach introduces some (tunable) noise and then keeps the top k values:
Step 1:
Step 2:
Step 3:
By using a low enough k (e.g. one or two), we can train and run inference much
faster than if many experts were activated.
→ Another strategies? YES, there exists
MoE Mechanisms
Visualization of six different routing
algorithms. Figure retrieved from [6].
MoE Mechanisms
Load balancing problems in MoEs
Batch sizes in MOEs matters the effectiveness due to the active experts.
For example:
If our batched input consists of 10 tokens:
- 5 tokens might end in one expert, and
- Other 5 tokens might end in five different experts.
→ Leading to uneven batch sizes and underutilization.
→ Gating network converges to mostly activate the same few experts
→ Self-reinforces as favored experts are trained quicker and hence
selected more
→ HOW TO RESOLVE THIS?
MoE Mechanisms
Load balancing problems in MoEs
1. Recalled: Noisy Top-k Gating
2. An auxiliary loss is added to encourage giving all experts equal
importance. This loss ensures that all experts receive a roughly equal
number of training examples.
3. The expert capacity, which introduces a threshold of how many
tokens can be processed by an expert.
For example: Switch Transformers also explores the concept of expert
capacity.
Discussions
When to use sparse MoEs vs dense models?
Experts are useful for high throughput scenarios with many machines. Given a fixed
compute budget for pretraining, a sparse model will be more optimal. For low throughput
scenarios with little VRAM, a dense model will be better.
Interpretability
Sparse expert models more naturally lend themselves to interpretability studies
because each input is processed by an identifiable, discrete subset of the model
weights (i.e. the chosen experts). Therefore, instead of the daunting task of
interpreting possibly trillions of floating point numbers, one can instead read off a
small discrete set of integers corresponding to which expert the input was sent.
Discussions
Interpretability
Expert Interpretability. Figure retrieved from [6].
References
[1] Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts.
Neural computation, 3(1), 79-87.
[2] Zian Wang (2023), Mixture of Experts: How an Ensemble of AI Models Decide As One,
https://deepgram.com/learn/mixture-of-experts-ml-model-guide
[3] Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch transformers: Scaling to trillion parameter models
with simple and efficient sparsity. Journal of Machine Learning Research, 23(120), 1-39.
[4] Eigen, D., Ranzato, M. A., & Sutskever, I. (2013). Learning factored representations in a deep
mixture of experts. arXiv preprint arXiv:1312.4314.
[5] Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously
large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.
[6] Fedus, W., Dean, J., & Zoph, B. (2022). A review of sparse expert models in deep learning. arXiv
preprint arXiv:2209.01667.
Download