A short introduction to Mixture-of-Experts (MoE)
Photo by Michael Fousert
Today, I want to talk about something I’ve been discussing a lot on Discord lately: Mixture-of-Experts (MoE). I believe MoE is the future of Large Language Models for several reasons, and I have been talking about that for a while now. I was quite happy when I saw that Mistral released their new model, Mixtral 8x7B, using this architecture. I won’t talk too much about the model itself, because we don’t have a lot of information about it, but I will talk about the architecture and why I think it’s the future of large language models. So, grab a cup of coffee and let’s get started.
The release of Mixtral 8x7B
On Friday, December 8th, 2023, the French start-up Mistral released their new AI model in a very original way: magnetic link on X. Their newest language model is called Mixtral 8x7B.
Mixtral is a sparse mixture-of-experts model, following an architecture that OpenAI is rumored to be using for GPT-4, but on a much larger scale. This innovative architecture comprises a total of 46.7 billion parameters, but functionally, it operates with 12.9 billion parameters per token.
Official benchmarks shows that Mixtral outperforms Llama 2 70B on most evaluations and matches or outperforms GPT-3.5 in standard benchmarks. Vercel has integrated Mixtral 8x7B into their SDK, allowing broader access for testing and implementation.
The architecture of Mixtral 8x7B is a sparse mixture-of-experts network. It’s a decoder-only model where the feedforward block picks from a set of 8 distinct groups of parameters. At every layer, for every token, a router network chooses two of these groups (the “experts”) to process the token and combine their output additively. It also seems to be a ‘switch tranformer’ architecture.
This design makes Mixtral more manageable in terms of computational demands and uniquely suited for various applications. The architecture, particularly its token inference process utilizing only 2 out of the 8 experts, optimizes processing efficiency and speed.
{
"dim": 4096,
"n_layers": 32,
"head_dim": 128,
"hidden_dim": 14336,
"n_heads": 32,
"n_kv_heads": 8,
"norm_eps": 1e-05,
"vocab_size": 32000,
"moe": {
"num_experts_per_tok": 2,
"num_experts": 8
}
}
This advanced setup, with a high-dimensional embedding space (dim: 4096
), multiple layers (n_layers: 32
), and numerous heads for attention mechanisms (n_heads: 32
), highlights the MoE architecture’s focus on efficient, specialized token processing.
Regarding its size and scalability, Mixtral 8x7B, with its total of 46.7 billion parameters, demonstrates a strategic approach in handling a large number of parameters efficiently. The model maintains a 32K token context size as GPT-4.
One of the most striking features of Mixtral 8x7B is its ability to offer high-level AI capabilities while being more accessible in terms of usability and computational resource requirements. The model’s approach to parameter utilization and efficiency sets a new standard in the field of AI.cing performance.
Defining MoE
The Mixture-of-Experts (MoE) model stands as a paradigm shift in neural network architecture. It adopts a modular approach, breaking down a complex problem into smaller sub-problems, each addressed by a specialized ‘expert’—a smaller neural network fine-tuned on specific data aspects or subsets. This modular design markedly differs from traditional, monolithic neural network structures.
The essence of MoE lies in ensemble learning, a concept that harnesses the collective strength of multiple ‘weak’ learners to form a more robust, ‘strong’ learner.
Although the exact origin of the MoE concept is unclear, its significant advancement was marked by Google’s publication (Shazeer et al., 2017). The paper introduced new dimensions to MoE’s understanding and application.
The intricate details of MoE’s implementation vary across different architectures as presented in various studies. However, the central principle remains consistent: the presence of a gating network. This network plays a crucial role in determining which expert is best suited to process a given input.
Significant research has also been dedicated to refining MoE’s routing mechanisms—the process of selecting the appropriate expert for a specific input. A commonly used method in this regard is the softmax gating, which involves applying a softmax function across the range of experts to determine their relative relevance to the given input.
For those interested in exploring MoE in greater depth, including its diverse implementations, the end of this article provides a comprehensive resource section.
Core Components of MoE
Image source: Shazeer et al., 2017
-
Experts: These are specialized neural networks, each trained to handle specific aspects of the overall task. Their specialization allows for exceptional performance in their respective areas
-
Gating Network: A pivotal component of the MoE architecture, the gating network acts as a dynamic selector, determining the most suitable expert or combination of experts for each input. This selection is based on the specific characteristics of the input and the areas of expertise of each network.
To better understand these abstract concepts, consider a real-world analogy involving medical specialists.
Imagine you’re a general practitioner faced with a patient suffering from a complex ailment. You recognize the need for a specialist, but the question is: which one? In this scenario, you—the general practitioner—act like the gating network, assessing the patient’s symptoms to determine the right specialist. For heart-related issues, you would refer to a cardiologist; for neurological concerns, a neurologist; and for lung problems, a pulmonologist. This process mirrors how the gating network in an MoE model selects the appropriate expert based on the given input.
To illustrate this concept with a simplified code example in Rust (though MoE models are far more complex), consider the following:
fn expert1(input: i32) -> i32 {
input * 2
}
fn expert2(input: i32) -> i32 {
input + 10
}
fn gating_function(input: i32) -> fn(i32) -> i32 {
if input % 2 == 0 {
expert1
} else {
expert2
}
}
fn main() {
let input = 5;
let chosen_expert = gating_function(input);
let result = chosen_expert(input);
println!("Result: {}", result);
}
While this example simplifies the MoE concept, it encapsulates the basic mechanism of how such a model operates.
How Does MoE Work?
-
Input Reception: The MoE model receives an input, much like any standard neural network.
-
Gating Mechanism Activation: The gating network evaluates the input and assigns weights to each expert, indicating their relevance for the current input.
-
Expert Processing: Based on the gating decision, one or more experts process the input. This processing can happen in parallel, offering computational efficiency.
-
Output Synthesis: The outputs from the activated experts are then aggregated. This aggregation is typically a weighted sum, guided by the gating network’s output.
-
Final Output: The synthesized output is then presented as the model’s response to the input.
Advantages
- Faster Inference with Sparse Activation: MoEs can potentially offer faster inference compared to traditional dense models. This is due to their architecture where only a subset of ‘experts’ are activated for a given input. For instance, if
n=1
, it means only the most relevant expert is activated for each token. This approach reduces the computational load per token, potentially speeding up inference. - Expertise and Specialization: By having dedicated experts, the model achieves a higher degree of specialization, potentially improving performance on specific tasks.
- Adaptability: Adding or modifying experts allows the MoE model to adapt to new tasks or changes in data distribution more easily than traditional models.
- Distillation into Smaller Models: An interesting aspect of MoEs is the possibility of distilling them into smaller, dense models. This process can potentially retain some of the quality gains achieved from the larger MoE model. This approach offers a balance between the benefits of fast pre-training (using large MoE models) and efficient inference (using smaller, distilled models).
Challenges
- Training Complexity: Coordinating multiple experts and a gating network increases the complexity of training MoE models.
- Risk of Overfitting: There’s a risk that experts might overfit to their specific sub-tasks, reducing the model’s generalizability.
- Expert Balancing: Ensuring that the workload and importance are evenly distributed among experts is a challenging aspect of MoE.
- High VRAM Requirements: Despite the potential for faster inference, MoEs are known for their high memory requirements. This is because all the experts in the model need to be loaded into VRAM, even though only a few are activated at a time.
- Performance in Distributed vs. Local Environments: MoEs tend to perform better in distributed computing environments where the workload can be spread across many machines. This is in contrast to local inference scenarios (e.g., running on a single machine), where the high VRAM requirement and complex orchestration of experts can be limiting.
- Pre-training vs. Fine-tuning: They perform better during pre-training within a fixed computational budget, but may not do as well during fine-tuning. I believe this is due to the nature of how experts are specialized and how they generalize to new tasks or datasets during the fine-tuning phase.
MoE in the Future
Emerging research focuses on enhancing MoE’s efficiency and scalability. Innovations like dynamically activated networks, which activate only a fraction of the model at any time, are making MoE models more viable and efficient.
However, it’s important to note that MoE are not that optimal for running in consumer devices. The reason is that MoE models are very large and require a lot of computational power. This is why I think that MoE models will be used in the cloud and not in consumer devices. Locally, you’re usually memory constrained. To maximize capabilities you should use a big dense model that maxes out device memory.
MoE makes perfect sense for serving to many users though because you can do expert parallelism across many devices and achieve less flops per request at a similar capability level (in theory) as a larger dense model, saving you money. I expect that in the future, we might have a MoE model that is optimised to run localy, but this is not currently the case.
Conclusion
Mixture-of-Experts models represent a paradigm shift in deep learning, offering a unique approach to tackling large-scale, complex tasks. Their ability to divide and conquer, leveraging specialized expertise, holds the potential to drive significant advancements in AI. As the field progresses, I anticipate seeing MoE models becoming integral to solving some of the most challenging problems in AI.
Sources and Further Reading:
-
Fedus, W., et al. (2021). “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.” arXiv preprint arXiv:2101.03961. Link
-
Zoph, B., et al. (2022). “Designing Effective Sparse Expert Models.” arXiv preprint arXiv:2202.08906v1. Link
-
Du, N., et al. (2021). “GLaM: Efficient Scaling of Language Models with Mixture-of-Experts”. Link
-
Shazeer, N., et al. (2017). “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.” arXiv preprint arXiv:1701.06538. Link
-
Bengio, Y., et al. (2013). “Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation.” Link
-
https://www.artfintel.com/p/papers-ive-read-this-week-mixture?nthPub=201
-
https://www.artfintel.com/p/more-on-mixture-of-experts-models
-
https://lilianweng.github.io/posts/2021-09-25-train-large/#mixture-of-experts-moe
-
https://www.youtube.com/playlist?list=PLvtrkEledFjoTA9cYo_wX6aG2WT5RFBY9