Inference
The process of using a trained AI model to make predictions or generate outputs from new inputs.
What is inference?
Inference is running a trained AI model to get outputs from new inputs. It's the "using" phase, as opposed to the "learning" phase (training).
Training:
- Feed massive datasets to the model
- Adjust weights to minimize errors
- Computationally expensive
- Done once (or occasionally)
Inference:
- Feed new data to the trained model
- Model produces predictions/outputs
- Much faster than training
- Done every time you use the model
When you chat with ChatGPT, that's inference. Every response is the model running inference on your prompt plus conversation history.
How does inference work?
For a language model:
- Tokenize: Convert your text input to tokens
- Embed: Convert tokens to numerical vectors
- Process: Run vectors through model layers
- Generate: Produce output token by token
- Decode: Convert output tokens back to text
Autoregressive generation: LLMs generate text one token at a time. Each new token becomes input for generating the next. A 100-word response requires hundreds of sequential inference steps.
Batching: To improve efficiency, models can process multiple requests simultaneously (batching), sharing computational overhead.
KV caching: Models cache intermediate computations to avoid redundant work when generating long sequences.
Inference costs and pricing
Why inference costs matter: While training is a one-time cost, inference happens with every use. High-volume applications can have significant inference costs.
Cost factors:
- Model size: Larger models = more computation = higher cost
- Input length: More tokens to process = higher cost
- Output length: More tokens to generate = higher cost
- Speed requirements: Faster inference often costs more
Typical pricing (API-based): Charged per token, with different rates for input and output:
- GPT-4o: ~$0.0025/1K input, ~$0.01/1K output tokens
- Claude 3.5 Sonnet: ~$0.003/1K input, ~$0.015/1K output tokens
- Open-source hosted: Often 10-50% cheaper
Self-hosted: Run models on your own infrastructure. Higher upfront cost, potentially lower per-inference cost at scale.
Inference optimization
Quantization: Reduce model precision (32-bit → 8-bit or 4-bit). Smaller memory footprint, faster inference, small quality trade-off.
Pruning: Remove unnecessary weights. Smaller, faster model.
Distillation: Train a smaller model to mimic a larger one. Much faster inference, some capability loss.
Caching: Store and reuse results for common queries. Instant responses for repeated questions.
Batching: Process multiple requests together. Better GPU utilization.
Speculative decoding: Use a small, fast model to draft responses that a larger model verifies. Faster generation.
Hardware acceleration: GPUs, TPUs, and specialized AI chips (like NVIDIA H100) dramatically speed inference.
Inference latency
Why latency matters: Users expect fast responses. High latency frustrates users and limits applications.
Components of latency:
- Network: Time to send request and receive response
- Queue wait: Time waiting if system is busy
- First token: Time to generate the first output token
- Generation: Time to complete the full response
Time to First Token (TTFT): Critical metric. Users perceive systems as faster when they see output start quickly, even if total time is the same.
Streaming: Return tokens as they're generated rather than waiting for completion. Feels much faster to users.
Latency targets:
- Interactive chat: < 500ms TTFT, streaming
- API applications: Varies by use case
- Background processing: Latency less critical than throughput
Related Terms
Large Language Model (LLM)
A neural network trained on massive text datasets that can understand and generate human-like language.
Tokens
The basic units that language models use to process text, typically representing parts of words, whole words, or punctuation.
Fine-tuning
The process of further training a pre-trained AI model on a specific dataset to improve its performance on particular tasks.