Architecture

Mixture of Experts (MoE)

**Mixture of Experts (MoE)** is a neural network architecture where input is routed to a subset of specialized "expert" sub-networks. This enables models with trillions of parameters while only activa

← Back to glossary

How It Works

Input arrives at a router/gating network
Router selects which experts to activate (e.g., 8 out of 384)
Selected experts process the input
Outputs are combined into the final result

Example: Kimi K2

Total parameters: 1 trillion
Activated per token: 32 billion
Number of experts: 384
Experts selected per token: 8

This means Kimi K2 has 1T parameter capacity but only uses 32B parameters per forward pass, making it computationally efficient.

Advantages

Scale efficiently: Massive model capacity without proportional compute costs
Specialization: Different experts can handle different types of inputs
Cost-effective inference: Only pay compute for activated parameters

Disadvantages

Complex training: Requires careful load balancing across experts
Memory requirements: Still need to store all parameters
Routing instability: Poor routing can concentrate load on few experts

Notable MoE Models

Kimi K2 (1T total, 32B active)
DeepSeek V3 (671B total)
Mixtral (8x7B, 8x22B)
GPT-4 (rumored)

Related Terms

AI Agents

Autonomous AI systems that can perceive their environment, make decisions, and take actions to achieve specific goals.

Build AI agents with Chipp

Create custom AI agents with knowledge, actions, and integrations—no coding required.