Overview of Mixture of Experts (MoE) in the Transformer-based models AIO Table of contents 1. Introduction to Mixture of Experts (MoE) 2. Basic taxonomy on MoEs 3. MoE Mechanism 4. Discussion Introduction to Mixture of Experts (MoE) What is Mixture of Experts? “Mixture of experts (MoE) is a machine learning approach that divides an artificial intelligence (AI) model into separate sub-networks (or “experts”), each specializing in a subset of the input data, to jointly perform a task.” - from IBM A system of expert and gating networks. (Left and right figures retrieved from [1], [2] respectively) Introduction to Mixture of Experts (MoE) What is Mixture of Experts? “Mixture of experts (MoE) is a machine learning approach that divides an artificial intelligence (AI) model into separate sub-networks (or “experts”), each specializing in a subset of the input data, to jointly perform a task.” - from IBM A system of expert and gating networks. (Left and right figures retrieved from [1], [2] respectively) Introduction to Mixture of Experts (MoE) What does MoE have? MoE consists of two main elements: 1. Sparse MoE layers These layers are alternative to feed-forward network (FFN) layers. Comparing a dense and sparse expert Transformer. Figure retrieved from [6] Introduction to Mixture of Experts (MoE) What does MoE have? MoE consists of two main elements: 1. Sparse MoE layers These layers are alternative to feed-forward network (FFN) layers. MoE layers have a certain number of “experts” (e.g. 8), where each expert is a neural network. In practice, the experts are FFNs, but they can also be more complex networks or even a MoE itself, leading to hierarchical MoEs! Switch Transformer encoder block replace the dense FFN layer with a sparse Switch FFN layer (light blue). Figure retrieved from [3]. Introduction to Mixture of Experts (MoE) What does MoE have? MoE consists of two main elements: MoE layer 1. Sparse MoE layers These layers are alternative to feed-forward network (FFN) layers. MoE layers have a certain number of “experts” (e.g. 8), where each expert is a neural network. In practice, the experts are FFNs, but they can also be more complex networks or even a MoE itself, leading to hierarchical MoEs! Switch Transformer encoder block replace the dense FFN layer with a sparse Switch FFN layer (light blue). Figure retrieved from [3]. Introduction to Mixture of Experts (MoE) What does MoE have? MoE consists of two main elements: 2. A gate network or router How is the expert chosen? A gating network determines the weights for each expert. E.g: This component determines which tokens are sent to which expert. How to route a token to an expert is one of the big decisions when working with MoEs. → The router is composed of learned parameters and is pretrained at the same time as the rest of the network. Switch Transformer encoder block replace the dense FFN layer with a sparse Switch FFN layer (light blue). Figure retrieved from [3]. Basic taxonomy on MoEs The roots of MoEs come from paper Adaptive Mixture of Local Experts by Jacob 1991 [1]. The idea, akin to ensemble methods, was to have a supervised procedure for a system composed of separate networks, each handling a different subset of the training cases. Two different research areas contributed to MoE advancement: 1. Experts as components In the traditional MoE setup, the whole system comprises a gating network and multiple experts. This allows having MoEs as layers in a multilayer network, making it possible for the model to be both large and efficient simultaneously. Deep Mixture of Experts [4] Basic taxonomy on MoEs The roots of MoEs come from paper Adaptive Mixture of Local Experts by Jacob 1991 [1]. The idea, akin to ensemble methods, was to have a supervised procedure for a system composed of separate networks, each handling a different subset of the training cases. Two different research areas contributed to MoE advancement: 2. Conditional Computation: Traditional networks process all input data through every layer. In this period, Yoshua Bengio researched approaches to dynamically activate or deactivate components based on the input token. MoE Mechanisms What is sparsity? Sparsity uses the idea of conditional computation. While in dense models all the parameters are used for all the inputs, sparsity allows us to only run some parts of the whole system. Leverage idea of conditional computational: → allows scaling the size of the model without increasing the computation. → allows thousands of experts being used in each MoE layer. MoE Mechanisms A learned gating network (G) decides which experts (E) to send a part of the input: What if G equals to 0? MoE Mechanisms A learned gating network (G) decides which experts (E) to send a part of the input: What if G equals to 0? → There’s no need to compute respective expert to save computational cost. MoE Mechanisms A learned gating network (G) decides which experts (E) to send a part of the input: What if G equals to 0? → There’s no need to compute respective expert to save computational cost. So, how will define the gating function? MoE Mechanisms With input x, A learned gating network (G) decides which experts (E) to send a part of the input: In the most traditional setup, we just use a simple network with a softmax function. The network will learn which expert to send the input. Learnable parameters for gating to choose expert MoE Mechanisms In the most traditional setup, we just use a simple network with a softmax function. The network will learn which expert to send the input. Learnable parameters for gating to choose expert MoE Mechanisms Another work for gating proposed in Shazeer [5] is using Noisy Top-k Gating. This gating approach introduces some (tunable) noise and then keeps the top k values: Step 1: Step 2: Step 3: By using a low enough k (e.g. one or two), we can train and run inference much faster than if many experts were activated. → Another strategies? YES, there exists MoE Mechanisms Visualization of six different routing algorithms. Figure retrieved from [6]. MoE Mechanisms Load balancing problems in MoEs Batch sizes in MOEs matters the effectiveness due to the active experts. For example: If our batched input consists of 10 tokens: - 5 tokens might end in one expert, and - Other 5 tokens might end in five different experts. → Leading to uneven batch sizes and underutilization. → Gating network converges to mostly activate the same few experts → Self-reinforces as favored experts are trained quicker and hence selected more → HOW TO RESOLVE THIS? MoE Mechanisms Load balancing problems in MoEs 1. Recalled: Noisy Top-k Gating 2. An auxiliary loss is added to encourage giving all experts equal importance. This loss ensures that all experts receive a roughly equal number of training examples. 3. The expert capacity, which introduces a threshold of how many tokens can be processed by an expert. For example: Switch Transformers also explores the concept of expert capacity. Discussions When to use sparse MoEs vs dense models? Experts are useful for high throughput scenarios with many machines. Given a fixed compute budget for pretraining, a sparse model will be more optimal. For low throughput scenarios with little VRAM, a dense model will be better. Interpretability Sparse expert models more naturally lend themselves to interpretability studies because each input is processed by an identifiable, discrete subset of the model weights (i.e. the chosen experts). Therefore, instead of the daunting task of interpreting possibly trillions of floating point numbers, one can instead read off a small discrete set of integers corresponding to which expert the input was sent. Discussions Interpretability Expert Interpretability. Figure retrieved from [6]. References [1] Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural computation, 3(1), 79-87. [2] Zian Wang (2023), Mixture of Experts: How an Ensemble of AI Models Decide As One, https://deepgram.com/learn/mixture-of-experts-ml-model-guide [3] Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120), 1-39. [4] Eigen, D., Ranzato, M. A., & Sutskever, I. (2013). Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314. [5] Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538. [6] Fedus, W., Dean, J., & Zoph, B. (2022). A review of sparse expert models in deep learning. arXiv preprint arXiv:2209.01667.