What is AI Inference? Training vs Inference

What is inference?

Inference is running a trained AI model to get outputs from new inputs. It's the "using" phase, as opposed to the "learning" phase (training).

Training:

Feed massive datasets to the model
Adjust weights to minimize errors
Computationally expensive
Done once (or occasionally)

Inference:

Feed new data to the trained model
Model produces predictions/outputs
Much faster than training
Done every time you use the model

When you chat with ChatGPT, that's inference. Every response is the model running inference on your prompt plus conversation history.

How does inference work?

For a language model:

Tokenize: Convert your text input to tokens
Embed: Convert tokens to numerical vectors
Process: Run vectors through model layers
Generate: Produce output token by token
Decode: Convert output tokens back to text

Autoregressive generation: LLMs generate text one token at a time. Each new token becomes input for generating the next. A 100-word response requires hundreds of sequential inference steps.

Batching: To improve efficiency, models can process multiple requests simultaneously (batching), sharing computational overhead.

KV caching: Models cache intermediate computations to avoid redundant work when generating long sequences.

Inference costs and pricing

Why inference costs matter: While training is a one-time cost, inference happens with every use. High-volume applications can have significant inference costs.

Cost factors:

Model size: Larger models = more computation = higher cost
Input length: More tokens to process = higher cost
Output length: More tokens to generate = higher cost
Speed requirements: Faster inference often costs more

Typical pricing (API-based): Charged per token, with different rates for input and output:

GPT-4o: ~$0.0025/1K input, ~$0.01/1K output tokens
Claude 3.5 Sonnet: ~$0.003/1K input, ~$0.015/1K output tokens
Open-source hosted: Often 10-50% cheaper

Self-hosted: Run models on your own infrastructure. Higher upfront cost, potentially lower per-inference cost at scale.

Inference optimization

Quantization: Reduce model precision (32-bit → 8-bit or 4-bit). Smaller memory footprint, faster inference, small quality trade-off.

Pruning: Remove unnecessary weights. Smaller, faster model.

Distillation: Train a smaller model to mimic a larger one. Much faster inference, some capability loss.

Caching: Store and reuse results for common queries. Instant responses for repeated questions.

Batching: Process multiple requests together. Better GPU utilization.

Speculative decoding: Use a small, fast model to draft responses that a larger model verifies. Faster generation.

Hardware acceleration: GPUs, TPUs, and specialized AI chips (like NVIDIA H100) dramatically speed inference.

Inference latency

Why latency matters: Users expect fast responses. High latency frustrates users and limits applications.

Components of latency:

Network: Time to send request and receive response
Queue wait: Time waiting if system is busy
First token: Time to generate the first output token
Generation: Time to complete the full response

Time to First Token (TTFT): Critical metric. Users perceive systems as faster when they see output start quickly, even if total time is the same.

Streaming: Return tokens as they're generated rather than waiting for completion. Feels much faster to users.

Latency targets:

Interactive chat: < 500ms TTFT, streaming
API applications: Varies by use case
Background processing: Latency less critical than throughput

Inference

What is inference?

How does inference work?

Inference costs and pricing

Inference optimization

Inference latency

Related Terms

Large Language Model (LLM)

Tokens

Fine-tuning

Large Language Model (LLM)

Tokens

Fine-tuning