Architecture

Mixture of Experts (MoE)

**Mixture of Experts (MoE)** is a neural network architecture where input is routed to a subset of specialized "expert" sub-networks. This enables models with trillions of parameters while only activa

How It Works

  1. Input arrives at a router/gating network
  2. Router selects which experts to activate (e.g., 8 out of 384)
  3. Selected experts process the input
  4. Outputs are combined into the final result

Example: Kimi K2

  • Total parameters: 1 trillion
  • Activated per token: 32 billion
  • Number of experts: 384
  • Experts selected per token: 8

This means Kimi K2 has 1T parameter capacity but only uses 32B parameters per forward pass, making it computationally efficient.

Advantages

  • Scale efficiently: Massive model capacity without proportional compute costs
  • Specialization: Different experts can handle different types of inputs
  • Cost-effective inference: Only pay compute for activated parameters

Disadvantages

  • Complex training: Requires careful load balancing across experts
  • Memory requirements: Still need to store all parameters
  • Routing instability: Poor routing can concentrate load on few experts

Notable MoE Models

  • Kimi K2 (1T total, 32B active)
  • DeepSeek V3 (671B total)
  • Mixtral (8x7B, 8x22B)
  • GPT-4 (rumored)

Build AI agents with Chipp

Create custom AI agents with knowledge, actions, and integrations—no coding required.

Learn more