Architecture
Mixture of Experts (MoE)
**Mixture of Experts (MoE)** is a neural network architecture where input is routed to a subset of specialized "expert" sub-networks. This enables models with trillions of parameters while only activa
How It Works
- Input arrives at a router/gating network
- Router selects which experts to activate (e.g., 8 out of 384)
- Selected experts process the input
- Outputs are combined into the final result
Example: Kimi K2
- Total parameters: 1 trillion
- Activated per token: 32 billion
- Number of experts: 384
- Experts selected per token: 8
This means Kimi K2 has 1T parameter capacity but only uses 32B parameters per forward pass, making it computationally efficient.
Advantages
- Scale efficiently: Massive model capacity without proportional compute costs
- Specialization: Different experts can handle different types of inputs
- Cost-effective inference: Only pay compute for activated parameters
Disadvantages
- Complex training: Requires careful load balancing across experts
- Memory requirements: Still need to store all parameters
- Routing instability: Poor routing can concentrate load on few experts
Notable MoE Models
- Kimi K2 (1T total, 32B active)
- DeepSeek V3 (671B total)
- Mixtral (8x7B, 8x22B)
- GPT-4 (rumored)
Build AI agents with Chipp
Create custom AI agents with knowledge, actions, and integrations—no coding required.
Learn more