LLM
Large language models — architecture, deployment, optimization, and cloud offerings
Overview
Large Language Models (LLMs) are deep neural networks trained on massive text corpora to understand and generate human language. Built on the Transformer architecture, they learn statistical patterns in language and generate text autoregressively — predicting the next token given all previous tokens. The scale of these models (billions of parameters) enables emergent capabilities like reasoning, code generation, translation, and instruction following.
Core Tokens & Context Window
Tokens are the atomic units of text the model processes — roughly 3/4 of a word in English. The context window is the maximum number of tokens the model can process in a single forward pass (e.g., 8K, 128K, 1M+). Everything the model reads and writes must fit within this window.
Core Parameters & Weights
Parameters (or weights) are the learned numerical values in the neural network. A 70B model has 70 billion parameters. More parameters generally means more capacity to store knowledge and perform complex reasoning, but requires proportionally more compute and memory.
Concept Inference vs Training
Training is the process of learning weights from data (months on thousands of GPUs, millions of dollars). Inference is using a trained model to generate text (seconds on a single GPU). Fine-tuning adapts a pre-trained model to a specific task with a smaller dataset — much cheaper than training from scratch.
Concept Autoregressive Generation
LLMs generate text one token at a time. At each step, the model computes a probability distribution over all possible next tokens and samples from it. This means generation speed is sequential — you cannot parallelize generating token 5 until token 4 exists. This is why inference optimization matters enormously.
History GPT & BERT Era (2018–2022)
Google's BERT (2018) demonstrated bidirectional understanding. OpenAI's GPT-2 (2019) showed coherent long-form generation. GPT-3 (2020, 175B params) demonstrated few-shot learning — the model could perform tasks just from examples in the prompt, without fine-tuning.
History Post-ChatGPT Explosion (2022–)
ChatGPT (Nov 2022) brought LLMs to the mainstream. Since then: GPT-4/5, Claude, Gemini, Llama, Mistral, Qwen, DeepSeek, and dozens more. Open-source models caught up rapidly. Instruction tuning and RLHF became standard. Context windows expanded from 4K to 1M+ tokens. Reasoning models (o1, o3, R1) introduced chain-of-thought at inference time.
Architecture & Internals
Modern LLMs are built on the Transformer architecture introduced in the 2017 paper "Attention Is All You Need." Understanding how transformers work is essential for optimizing inference, choosing hardware, and debugging model behavior.
Core transformer components
Attention Self-Attention
The key innovation. Each token attends to every other token in the sequence, computing relevance scores. This lets the model understand relationships regardless of distance — "the cat sat on the mat because it was tired" — the model learns that "it" refers to "cat" through attention weights.
Attention Multi-Head Attention
Instead of one attention computation, the model runs multiple heads in parallel, each learning different relationship patterns (syntax, semantics, coreference, etc.). Outputs are concatenated and projected. Typical: 32–128 heads depending on model size.
Layer Feed-Forward Network
After attention, each token passes through a position-wise feed-forward network (typically two linear layers with a nonlinearity like SiLU/GELU). This is where most of the model's parameters live — the FFN layers store factual knowledge.
Layer Positional Encoding
Transformers have no inherent notion of token order. Positional encodings inject position information. Modern models use RoPE (Rotary Position Embeddings) which encode relative positions and can be extended to longer sequences than seen during training.
Forward pass as a DAG
KV cache
During autoregressive generation, the model recomputes attention over all previous tokens at each step. The KV cache stores the Key and Value matrices from previous tokens so they don't need to be recomputed. This turns generation from O(n²) to O(n) per step, but the cache grows linearly with sequence length and consumes significant GPU memory.
Decoder-only vs encoder-decoder
| Architecture | How it works | Examples |
|---|---|---|
| Decoder-only | Processes input and output as a single sequence with causal masking (each token can only attend to previous tokens). Simpler, scales better. | GPT-4, Claude, Llama, Mistral |
| Encoder-decoder | Encoder processes input bidirectionally, decoder generates output attending to encoder output. Better for structured input→output tasks. | T5, BART, original Transformer |
Mixture of Experts (MoE)
MoE models replace the single FFN in each block with multiple expert FFNs and a router network. For each token, the router selects only 2 of (say) 8 experts. This means a 46.7B-parameter MoE model (Mixtral 8x7B) only activates ~12.9B parameters per token, giving near-7B inference speed with much larger total capacity. DeepSeek V3 takes this further with 671B total parameters but only 37B active per token.
The fundamental bottleneck in LLM inference is memory bandwidth, not compute. Generating each token requires reading the model weights from GPU memory. A 70B FP16 model is 140GB — even an H100 (3.35 TB/s bandwidth) takes ~42ms just to read the weights once per token. This is why quantization (making weights smaller) directly improves throughput.
Open Source Models
The open-source LLM ecosystem has exploded since Meta released Llama 2 in July 2023. Today, open models rival or exceed proprietary ones for many tasks, and can be self-hosted for data privacy, cost control, and low-latency inference. MoE architectures (Llama 4, DeepSeek V3, Qwen 3) have made frontier-quality open models more accessible.
Major model families
Meta Llama 4
Llama 4 (April 2025) introduced MoE architectures: Scout (109B total, 17B active, 16 experts, 10M context) and Maverick (400B total, 17B active, 128 experts, 1M context). Natively multimodal (text + image input). Previous generation Llama 3.1/3.3 (8B, 70B, 405B dense) remain widely deployed. Llama Community License.
Mistral AI Mistral / Mixtral
Mistral Large 3 (675B total, 41B active, MoE, 256K context, multimodal). Ministral 3 (3B, 8B, 14B dense) for edge. Mixtral 8x7B (47B total, 13B active, MoE) remains popular for efficiency. Codestral 25.01 specialized for code. Apache 2.0 license for many models.
Alibaba Qwen 3 / 3.5
Qwen 3 (April 2025): dense (0.6B–32B) and MoE (30B-A3B, 235B-A22B). Qwen 3.5 (Feb 2026): 397B with native multimodal. Excellent multilingual support (especially CJK). Strong coding and math. Apache 2.0 license. Hybrid thinking modes for reasoning.
Google Gemma 3
Sizes: 270M, 1B, 4B, 12B, 27B. Multimodal (image + text) at 4B+. Up to 128K context. Designed for edge and on-device deployment. Built with the same research as Gemini. Good for resource-constrained environments where you need a capable small model.
Microsoft Phi-4
Sizes: 3.8B (Mini), 5.6B (Multimodal), 14B, 15B (Reasoning-Vision). Excels at complex reasoning, math, and coding for its size. Trained on high-quality synthetic data. MIT license. Phi-4-reasoning variants add chain-of-thought capabilities.
DeepSeek DeepSeek V3 / R1
671B MoE (37B active). V3 excels at coding and general tasks; V3.1 and V3.2 (685B) added improved reasoning and agentic capabilities. R1 is a reasoning-focused model using chain-of-thought. Trained at a fraction of the cost of comparable models. Open weights, commercially permissive.
Cohere Command R+
104B parameters. Specifically optimized for RAG (Retrieval-Augmented Generation) with built-in citation generation. Strong tool use and multi-step reasoning. Available via Cohere's API and on cloud marketplaces. CC-BY-NC license for the weights.
VRAM requirements (rule of thumb)
FP16 VRAM ≈ 2× parameter count in GB. Q4 (4-bit quantized) VRAM ≈ 0.5–0.6× parameter count in GB. Add 1–4GB overhead for KV cache depending on context length.
| Model | Parameters | FP16 VRAM | Q4 VRAM | Min GPU |
|---|---|---|---|---|
| Phi-4 Mini | 3.8B | ~7.6 GB | ~2.5 GB | RTX 3060 12GB |
| Mistral 7B | 7B | ~14 GB | ~4 GB | RTX 4070 Ti 16GB |
| Llama 3.1 8B | 8B | ~16 GB | ~5 GB | RTX 3090 24GB |
| Qwen 3 32B | 32B | ~64 GB | ~18 GB | A100 80GB |
| Llama 3.3 70B | 70B | ~140 GB | ~38 GB | 2× A100 80GB |
| Mixtral 8x7B | 47B (13B active) | ~94 GB | ~26 GB | 2× RTX 4090 |
| Llama 4 Maverick | 400B (17B active) | ~800 GB | ~220 GB | 8× A100 80GB |
| DeepSeek V3 | 671B (37B active) | ~1.3 TB | ~370 GB | 8× H100 80GB |
For most production use cases, quantized models are the way to go. A Q4-quantized 70B model running on 2× RTX 4090s often outperforms an FP16 7B model on a single GPU — more knowledge, similar speed, and the quality loss from quantization is minimal for most tasks.
Proprietary Models & Cloud
Proprietary models from major cloud providers offer the highest capability for complex tasks, managed infrastructure, and enterprise compliance. The tradeoff is cost, data privacy considerations, and vendor lock-in.
Azure OpenAI (via Azure)
GPT-4.1 (1M context), GPT-4o (multimodal, fast), o3/o4-mini (reasoning models with chain-of-thought), GPT-5 (frontier). Deployed via Azure OpenAI Service — same API as OpenAI, but data stays in your Azure region. Enterprise compliance (SOC 2, HIPAA eligible), content filtering, rate limiting. Pricing: per-token (input tokens cheaper than output). Azure AI Foundry for prototyping and evaluation.
API Anthropic (Claude)
Claude Opus 4.6 (highest capability), Claude Sonnet 4.6 (balanced), Claude Haiku 3.5 (fast, cheap). API-only — no self-hosted option. Also available on AWS Bedrock and Google Vertex AI. 1M token context window. Strong for long-context analysis, coding, and careful instruction following. Excels at reducing hallucination.
GCP Google (Gemini)
Gemini 3 Flash (fast, default), Gemini 3 Pro (frontier reasoning), Gemini 2.5 Pro (widely deployed). Natively multimodal — text, image, video, and audio in a single model. Via Vertex AI (enterprise) or Google AI Studio (prototyping). Vertex AI Model Garden also hosts third-party models (Llama, Claude, Mistral).
AWS AWS Bedrock
Managed service hosting multiple providers under a unified API: Claude, Llama, Mistral, Cohere, Stability AI, and more. Pay-per-token, no infrastructure management. Supports fine-tuning, knowledge bases (RAG), and agents. Data stays within your AWS account. Good for organizations already on AWS who want model flexibility.
Others Additional Providers
Cohere — enterprise NLP, embeddings (Embed v3), RAG-optimized Command models. AI21 — Jamba (SSM-Transformer hybrid). xAI — Grok models. Perplexity — search-augmented generation, good for factual queries. Groq — ultra-fast inference on custom LPU hardware. Together AI, Fireworks AI, DeepInfra — hosted open-source model APIs with competitive pricing.
Provider comparison
| Provider | Top Model | Context | Strengths | Pricing Model |
|---|---|---|---|---|
| OpenAI / Azure | GPT-4.1, o3, GPT-5 | 1M | Broadest capability, multimodal, reasoning | Per-token (input/output) |
| Anthropic | Claude Opus 4.6 | 1M | Long-context, coding, low hallucination | Per-token (input/output) |
| Gemini 3 Pro | 2M | Native multimodal, huge context, integration | Per-token / per-character | |
| AWS Bedrock | Multi-provider | Varies | Unified API, model flexibility, AWS ecosystem | Per-token (varies by model) |
| Cohere | Command A / R+ | 128K | RAG-optimized, enterprise embeddings | Per-token |
When using proprietary APIs, your prompts and completions leave your infrastructure. For sensitive data, use Azure OpenAI Service (data stays in region, no training on your data), AWS Bedrock (data stays in your VPC), or self-host open models. Always review the provider's data processing agreement and retention policies.
vLLM
vLLM is the leading open-source high-throughput serving engine for LLMs. It dramatically improves inference performance through PagedAttention and continuous batching, making it the go-to choice for production self-hosted deployments.
Key innovations
Core PagedAttention
Manages the KV cache like virtual memory pages. Traditional serving pre-allocates contiguous memory for each request's max sequence length, wasting 60–80% of GPU memory. PagedAttention allocates KV cache in non-contiguous blocks on demand, enabling near-zero memory waste and 2–4× more concurrent requests.
Core Continuous Batching
Traditional batching waits for the longest sequence to finish before starting new requests. Continuous batching dynamically adds and removes requests mid-batch at each generation step. This keeps GPU utilization high and reduces tail latency significantly.
Feature Tensor Parallelism
Splits model layers across multiple GPUs. A 70B model can be served across 2× or 4× GPUs with near-linear throughput scaling. vLLM handles the inter-GPU communication (NCCL) automatically.
Feature OpenAI-Compatible API
vLLM serves an OpenAI-compatible REST API out of the box. Drop-in replacement — change the base URL from api.openai.com to your vLLM server and existing code works. Supports /v1/chat/completions, /v1/completions, and /v1/models.
Starting a vLLM server
# Install vLLM
pip install vllm
# Start OpenAI-compatible server with tensor parallelism
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90
# With quantized model (AWQ)
vllm serve TheBloke/Llama-2-70B-Chat-AWQ \
--quantization awq \
--tensor-parallel-size 4
# Query the server (same as OpenAI API)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Explain PagedAttention"}],
"max_tokens": 512,
"temperature": 0.7
}'
Serving frameworks comparison
| Framework | Developer | Key Feature | Best For |
|---|---|---|---|
| vLLM | UC Berkeley / community | PagedAttention, continuous batching | High-throughput production serving |
| TGI | Hugging Face | Flash Attention, token streaming | HuggingFace ecosystem integration |
| TensorRT-LLM | NVIDIA | Kernel-level optimization, FP8 | Maximum single-request latency on NVIDIA GPUs |
| llama.cpp | ggerganov | CPU inference, GGUF quantization | Local/edge deployment, CPU+GPU hybrid |
| SGLang | LMSYS | RadixAttention, structured generation | Constrained decoding, JSON output |
For most production deployments, start with vLLM. It has the best balance of throughput, ease of use, and model support. Switch to TensorRT-LLM only if you need absolute minimum latency and are willing to deal with the build/compilation complexity. Use llama.cpp for local development and CPU-only environments.
Ollama & Open WebUI
Ollama makes running LLMs locally as simple as running Docker containers. Open WebUI provides a ChatGPT-like web interface on top. Together, they form the fastest path to running models on your own hardware.
Ollama
Ollama wraps llama.cpp in a user-friendly CLI with a model registry, automatic GGUF quantization handling, and a REST API. It manages model downloads, GPU detection, and memory allocation automatically.
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Run a model (downloads automatically on first use)
ollama run llama3.1
ollama run mistral
ollama run codellama:34b
# Pull a model without running it
ollama pull qwen2.5:72b
# List downloaded models
ollama list
# Show model details (parameters, quantization, size)
ollama show llama3.1
# REST API (default: localhost:11434)
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.1", "prompt": "What is PagedAttention?"}'
# Chat API (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1",
"messages": [{"role": "user", "content": "Hello"}]
}'
Modelfile (custom models)
# Modelfile — like a Dockerfile for LLMs
FROM llama3.1
# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
# System prompt
SYSTEM """You are a senior DevOps engineer. Answer questions about
infrastructure, CI/CD, and cloud architecture. Be concise and
provide code examples when relevant."""
# Build and run
# ollama create devops-assistant -f Modelfile
# ollama run devops-assistant
Open WebUI
Open WebUI is a self-hosted ChatGPT-style interface that connects to Ollama (or any OpenAI-compatible API). Features include multi-model chat, RAG with document upload, conversation history, user management, and model management.
# Deploy Open WebUI with Docker (connects to Ollama on host)
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data --name open-webui \
ghcr.io/open-webui/open-webui:main
# With GPU support and bundled Ollama
docker run -d -p 3000:8080 --gpus all \
-v ollama:/root/.ollama -v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:ollama
# Access at http://localhost:3000
# First user to sign up becomes admin
Feature Open WebUI Highlights
- Multi-model conversations (switch models mid-chat)
- RAG: upload PDFs, docs — model answers from your documents
- Conversation branching and regeneration
- User management with role-based access
- Model management (pull/delete from UI)
- Custom prompts library
- Web search integration
Alt LM Studio
LM Studio is a desktop application (macOS, Windows, Linux) for running LLMs locally. GUI-based model discovery and download from HuggingFace. Built-in chat interface and OpenAI-compatible local server. Good for non-technical users or quick experimentation. Supports GGUF models with automatic GPU offloading.
Optimization & Quantization
Running large models on limited hardware requires aggressive optimization. Quantization is the most impactful technique — reducing numerical precision to shrink model size and increase throughput with minimal quality loss.
Quantization methods
Format GGUF (llama.cpp)
The standard format for CPU+GPU hybrid inference. Supports Q2 through Q8 quantization levels. Models can be partially offloaded to GPU. Most flexible for consumer hardware. Used by Ollama, LM Studio, and llama.cpp directly.
Format GPTQ
GPU-focused post-training quantization. Uses calibration data to minimize quantization error. 4-bit and 8-bit variants. Fast inference on NVIDIA GPUs via auto-gptq or exllama. Slightly better quality than naive round-to-nearest quantization.
Format AWQ (Activation-Aware)
Identifies the 1% of weights that matter most (based on activation magnitudes) and preserves them at higher precision. Better quality than GPTQ at the same bit width. Supported natively by vLLM. The recommended quantization for production GPU serving.
Format bitsandbytes
On-the-fly quantization during model loading. No pre-quantized model needed. load_in_4bit or load_in_8bit in HuggingFace Transformers. Convenient for experimentation but slower than pre-quantized formats. Enables QLoRA fine-tuning.
Precision comparison
| Precision | Bits/Param | VRAM (7B) | Quality Impact | Use Case |
|---|---|---|---|---|
| FP32 | 32 | ~28 GB | Baseline (training) | Training only |
| FP16 / BF16 | 16 | ~14 GB | Negligible vs FP32 | Default inference |
| INT8 | 8 | ~7 GB | Minimal (<1% quality loss) | Production serving |
| INT4 / Q4 | 4 | ~4 GB | Small (1–3% quality loss) | Consumer GPUs, edge |
| Q2 / Q3 | 2–3 | ~2–3 GB | Noticeable degradation | Extreme constraints only |
Attention optimizations
Memory Flash Attention
Rewrites the attention computation to be IO-aware. Instead of materializing the full N×N attention matrix in GPU HBM, it computes attention in tiles that fit in SRAM. Reduces memory from O(n²) to O(n) and is 2–4× faster. Now standard in all major frameworks (Flash Attention 2/3).
Memory GQA & MQA
GQA (Grouped Query Attention): multiple query heads share a single KV head group, reducing KV cache size by 4–8×. Used by Llama 2/3/4, Mistral, Qwen. MQA (Multi-Query Attention): all query heads share one KV head. Even smaller cache but slightly lower quality. Used by Falcon, PaLM.
Advanced techniques
Speed Speculative Decoding
Use a small draft model (e.g., 1B) to predict 4–8 tokens ahead. The large model verifies all draft tokens in a single forward pass (parallel). If the draft is correct, you get multiple tokens for the cost of one large-model call. Typically 2–3× speedup for greedy decoding.
Size Pruning & Distillation
Pruning: remove weights that contribute little (structured or unstructured sparsity). Distillation: train a smaller "student" model to mimic a larger "teacher." Combined: distill a 70B model into a 7B model that retains 80–90% of the teacher's quality on your specific task.
Which quantization for your hardware?
16–24 GB Consumer GPU
RTX 3090/4090 (24GB) or RTX 5090 (32GB). Use Q4 GGUF via Ollama for up to 13B models. Use AWQ 4-bit via vLLM for maximum throughput. Can fit a Q4 34B model with careful memory management.
48–80 GB Single Data Center GPU
A100 40/80GB or H100 80GB. Run 7–13B models at FP16, or 70B at Q4/Q8. AWQ quantization via vLLM for production serving. FP8 on H100 for near-FP16 quality at half the memory.
Multi-GPU 2–8 GPUs
Run 70B+ models at FP16 with tensor parallelism. 2× A100 80GB = 160GB for a 70B FP16 model. 8× H100 = 640GB for a 405B model at Q4. Use vLLM or TensorRT-LLM for multi-GPU coordination.
CPU Only No GPU
Use llama.cpp / Ollama with Q4 GGUF models. Expect 2–10 tokens/sec for 7B models depending on CPU cores and RAM speed. Viable for low-traffic APIs and development. Needs 8–16GB RAM for 7B Q4.
Small Models & Edge
Not every problem needs GPT-4. Small models (under 4B parameters) can run on phones, laptops, Raspberry Pis, and embedded systems. For narrow, well-defined tasks, a fine-tuned small model often beats a general-purpose large model — at a fraction of the cost and latency.
Notable small models
3.8B Phi-4 Mini
Microsoft's small powerhouse. Runs on phones and laptops. Strong reasoning for its size, trained on high-quality synthetic data. 128K context window. Available in ONNX format for optimized mobile inference.
1B–4B Gemma 3
Google's edge-optimized models. Gemma 3 1B (text-only, 32K context) and 4B (multimodal, 128K context). Small enough for IoT devices with GPU acceleration. Good for classification, extraction, and simple generation tasks. Competitive with much larger models on narrow tasks after fine-tuning.
1.1B TinyLlama
Trained on 3T tokens (same as Llama 2 7B's training data volume). Surprisingly capable for 1.1B parameters. Good for embedded systems, rapid prototyping, and as a draft model for speculative decoding.
0.6B Qwen 3 0.6B
Ultra-lightweight, runs on almost anything. Good for simple classification, entity extraction, and template-based generation. Useful when you need a model that responds in single-digit milliseconds.
135M–1.7B SmolLM2 (HuggingFace)
HuggingFace's family of tiny models: 135M, 360M, and 1.7B. The 1.7B variant is trained on 11T tokens. The 135M model is small enough to run in a browser via WebAssembly. Good for research, education, and extremely constrained environments.
Edge use cases
- On-device assistants — Siri/Google-style assistants that run locally, preserving privacy and working offline
- Code completion in IDEs — sub-100ms completions using small code models (e.g., StarCoder2 3B, Codestral Mini)
- Smart home / IoT — natural language control of devices without cloud dependency
- Offline translation — small translation models for fieldwork, travel, or air-gapped environments
- Classification & extraction — sentiment analysis, named entity recognition, intent detection — fine-tuned small models match or beat large models
- Structured data extraction — parse invoices, receipts, medical records into JSON locally
When small beats large
Win Narrow Domain
A LoRA fine-tuned 3B model on your specific domain data (legal docs, medical records, your codebase) often outperforms GPT-4 on that domain — because it's seen thousands of your examples vs zero.
Win Speed & Cost
A 1B model generates at 100+ tokens/sec on a consumer GPU vs 30 tokens/sec for a 70B model. At scale, this means 100× lower compute cost per token. For high-throughput classification APIs, small models are the only viable option.
Win Privacy & Offline
When data cannot leave the device (healthcare, legal, government), small on-device models are the only option. No API calls, no data in transit, no third-party access. Full compliance with data sovereignty requirements.
Win Latency-Critical
Real-time applications (autocomplete, voice assistants, game NPCs) need sub-50ms first-token latency. Only small, locally-running models can achieve this consistently without network roundtrips.
Start with the smallest model that meets your quality threshold. Fine-tune with LoRA/QLoRA on your specific task data. Evaluate rigorously. Only move to larger models if the small model truly cannot meet your quality bar. Many teams jump to frontier APIs by default when a fine-tuned Phi-4 or Gemma 3 would have worked at 1/100th the cost.
Tools & Augmentation
LLMs alone are limited by their training data cutoff and inability to take actions. Tools and augmentation techniques connect models to real-time data, external systems, and multi-step workflows.
Function / tool calling
Models can output structured JSON to invoke functions you define. The model decides when to call a tool based on the user's request. Supported by OpenAI, Claude, Gemini, and most open models with instruction tuning.
# OpenAI function calling example
import openai
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["city"]
}
}
}]
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What's the weather in Toronto?"}],
tools=tools,
tool_choice="auto"
)
# Model responds with: tool_calls[0].function.name = "get_weather"
# tool_calls[0].function.arguments = '{"city": "Toronto", "unit": "celsius"}'
RAG (Retrieval-Augmented Generation)
RAG grounds model responses in your own data. The pattern: embed documents into vectors, store in a vector database, retrieve relevant chunks at query time, and include them in the prompt.
DB Vector Databases
- pgvector — PostgreSQL extension, simplest if you already run Postgres
- Chroma — lightweight, embedded, good for prototyping
- Pinecone — fully managed SaaS, scales to billions of vectors
- Weaviate — open-source, hybrid search (vector + keyword)
- Milvus — open-source, high performance, distributed
Tips RAG Best Practices
- Chunk documents into 256–512 token segments with overlap
- Use a strong embedding model (Cohere Embed v3, OpenAI text-embedding-3-large)
- Retrieve 5–10 chunks, rerank with a cross-encoder before passing to LLM
- Include metadata (source, page, date) so the model can cite sources
- Test with both relevant and adversarial queries
MCP (Model Context Protocol)
The Model Context Protocol is Anthropic's open standard for connecting LLMs to external tools and data sources. Think of it as USB-C for AI — a standardized interface that any model can use to connect to any tool.
Architecture MCP Components
- Host — the application (Claude Desktop, IDE, your app)
- Client — maintains 1:1 connection with a server
- Server — exposes tools, resources, and prompts
Transport: stdio (local processes) or Streamable HTTP (bidirectional via a single endpoint for remote servers; replaces the deprecated SSE transport).
vs MCP vs Function Calling
- Function calling: stateless, per-request tool definitions
- MCP: persistent connections, dynamic tool discovery
- MCP servers can expose resources (read data) and prompts (reusable templates) in addition to tools
- MCP enables a server ecosystem — install once, use across all MCP clients
// Example MCP server tool definition
{
"name": "query_database",
"description": "Execute a read-only SQL query against the production database",
"inputSchema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "SQL SELECT query to execute"
}
},
"required": ["query"]
}
}
Agents
Agents are LLMs that plan and execute multi-step tasks using tools. The model operates in a ReAct loop: Reason about what to do, Act by calling a tool, Observe the result, and repeat until the task is complete.
# Simplified ReAct agent loop
while not task_complete:
# 1. Reason: LLM decides what to do next
response = llm.generate(
system="You are an agent. Use tools to accomplish the task.",
messages=conversation_history
)
# 2. Act: Execute tool calls from the response
if response.tool_calls:
for call in response.tool_calls:
result = execute_tool(call.name, call.arguments)
conversation_history.append(tool_result(call.id, result))
# 3. Observe: LLM sees the result and decides next step
else:
task_complete = True # LLM responded with text, not a tool call
Popular frameworks: LangGraph (LangChain's agent framework, graph-based workflows), CrewAI (multi-agent collaboration), AutoGen (Microsoft, multi-agent conversations), Claude Code (Anthropic's coding agent). For simple tool-calling patterns, you often don't need a framework — a while loop with the model's tool calling API is sufficient.
Infrastructure & GPUs
Choosing the right hardware for LLM inference is critical. The key constraint is GPU memory (VRAM) — the entire model (or its quantized version) plus the KV cache must fit in GPU memory for efficient inference.
GPU comparison
| GPU | VRAM | Memory BW | FP16 TFLOPS | Price Range | Best For |
|---|---|---|---|---|---|
| RTX 3090 | 24 GB GDDR6X | 936 GB/s | 71 | ~$800 used | Budget inference, dev |
| RTX 4090 | 24 GB GDDR6X | 1,008 GB/s | 165 | ~$1,800 | Best consumer GPU |
| RTX 5090 | 32 GB GDDR7 | 1,792 GB/s | 209 | ~$2,000 | Consumer, more VRAM |
| A100 (80GB) | 80 GB HBM2e | 2,039 GB/s | 312 | ~$1.50/hr cloud | Production inference |
| H100 (SXM) | 80 GB HBM3 | 3,350 GB/s | 990 | ~$2.50/hr cloud | High-throughput serving |
| H200 | 141 GB HBM3e | 4,800 GB/s | 990 | ~$3.50/hr cloud | Large models, long context |
| B200 | 192 GB HBM3e | 8,000 GB/s | 2,250 | ~$5/hr cloud | Frontier model serving |
Multi-GPU strategies
Strategy Tensor Parallelism
Split each layer's weight matrices across GPUs. Each GPU computes part of every layer, then they synchronize. Best for inference latency — all GPUs work on every token. Requires fast interconnect (NVLink). Use when model doesn't fit on one GPU.
Strategy Pipeline Parallelism
Assign different layers to different GPUs (GPU 0 = layers 0–19, GPU 1 = layers 20–39). Tokens flow through GPUs sequentially. Lower interconnect requirements but introduces pipeline bubbles. Better for training than inference.
Strategy Data Parallelism
Replicate the full model on each GPU, process different requests on each replica. Best for throughput when the model fits on a single GPU. Simple to set up — just run multiple vLLM instances behind a load balancer.
Strategy Expert Parallelism (MoE)
For Mixture of Experts models, distribute different experts across GPUs. The router sends each token to the GPU(s) hosting its assigned experts. Efficient because each token only needs 2 of N experts, so inter-GPU communication is limited.
Non-GPU options
CPU CPU Inference
Via llama.cpp / Ollama. Practical for small quantized models (7B Q4 at 2–10 tok/s). Leverages AVX-512 or ARM NEON. Good for low-traffic APIs, dev environments, and offline batch processing. Needs fast RAM — DDR5 helps significantly.
Apple Apple Silicon
M1–M4 chips have unified memory — the system RAM is the GPU VRAM. An M4 Max with 128GB RAM can run a 70B model at FP16. Metal acceleration via llama.cpp/Ollama. The best laptop option for running large models locally.
Cloud GPU providers
| Provider | GPU Options | Pricing Model | Notes |
|---|---|---|---|
| AWS (p5/p4 instances) | H100, A100 | On-demand, spot, reserved | Broad ecosystem, SageMaker integration |
| GCP (A3/A2 instances) | H100, A100 | On-demand, spot, committed | TPUs also available, Vertex AI |
| Lambda Labs | H100, A100, A10G | On-demand hourly | Simple, competitive pricing |
| RunPod | H100, A100, RTX 4090 | On-demand, spot | Serverless GPU option, community templates |
| vast.ai | Mixed consumer/datacenter | Marketplace (bidding) | Cheapest, least reliable, peer-to-peer |
Security & Governance
LLMs introduce novel security risks that traditional application security doesn't cover. Prompt injection, data leakage, hallucination, and model poisoning require new mitigation strategies.
Prompt injection
Attack Direct Injection
User crafts input that overrides the system prompt: "Ignore previous instructions and instead...". Mitigations: input validation, system prompt reinforcement, output filtering. No perfect defense exists — defense in depth is required.
Attack Indirect Injection
Malicious instructions embedded in data the model retrieves (web pages, documents, emails). When the model processes this data via RAG or web browsing, it follows the injected instructions. Harder to detect because the attack is in the data, not the user's input.
Data leakage
- PII in prompts — users may paste sensitive data (SSNs, credentials, medical records) into prompts sent to third-party APIs. Implement PII detection and redaction before API calls.
- Secrets in code — developers using AI coding assistants may inadvertently share API keys, connection strings, or internal architecture details. Use secret scanning on prompts.
- Training data extraction — adversarial prompts can sometimes extract memorized training data. Mitigations: differential privacy in training, output monitoring.
Guardrails & output filtering
Tool NVIDIA NeMo Guardrails
Open-source framework for adding programmable guardrails to LLM applications. Define conversation flows, topic boundaries, and safety rails in a simple Colang language. Intercepts both input and output.
Tool Llama Guard
Meta's safety classifier model. Runs as a separate model that classifies inputs and outputs as safe/unsafe across categories (violence, self-harm, illegal activity, etc.). Use as a pre/post-filter around your main LLM.
Compliance & governance
| Concern | Risk | Mitigation |
|---|---|---|
| Data residency | Prompts sent to US-based APIs may violate EU data residency | Use regional endpoints (Azure OpenAI), self-host, or on-premise models |
| GDPR (right to erasure) | User data in training data cannot be selectively removed | Don't fine-tune on user data without consent; use RAG instead (deletable) |
| Audit logging | No record of what the model was asked or answered | Log all prompts, completions, token usage, and model versions |
| Cost controls | Runaway API costs from bugs or abuse | Per-user rate limits, daily spend caps, budget alerts |
| API key management | Leaked keys = unauthorized usage and billing | Rotate regularly, use short-lived tokens, scope permissions |
Hallucination mitigation
- Grounding — use RAG to provide factual context; instruct the model to only answer from provided sources
- Citations — require the model to cite specific passages from retrieved documents
- Confidence thresholds — use logprobs (token probabilities) to detect low-confidence generations
- Structured output — constrain output to JSON schemas, reducing free-form hallucination
- Multi-model verification — use a second model to fact-check the first model's output
Never trust LLM output for safety-critical decisions without human review or automated verification. LLMs can generate confident, well-formatted text that is factually wrong. Always validate against authoritative sources, especially for medical, legal, and financial applications.
Production Checklist
- Choose the right model size — start with the smallest model that meets your quality threshold. Benchmark on your specific task before committing. A fine-tuned 8B model may outperform a generic 70B model on your domain.
- Size your hardware — calculate VRAM needs: model weights + KV cache + overhead. Plan for peak concurrent requests, not average. Include 10–20% memory headroom for stability.
- Pick a quantization strategy — use AWQ for GPU serving (vLLM), GGUF for CPU+GPU hybrid (Ollama). Benchmark quality on your eval set before and after quantization. Q4 is the sweet spot for most deployments.
- Deploy with a production serving framework — use vLLM or TensorRT-LLM, not raw HuggingFace Transformers. Enable continuous batching and PagedAttention for throughput. Set
max-model-lento limit KV cache growth. - Implement health checks and monitoring — monitor GPU utilization, VRAM usage, request latency (P50/P95/P99), throughput (tokens/sec), queue depth, and error rates. Alert on GPU memory > 90%, latency spikes, and OOM kills.
- Set up input/output guardrails — filter PII from inputs before they reach the model. Validate outputs against schemas. Use safety classifiers (Llama Guard) for user-facing applications. Log everything.
- Implement rate limiting and cost controls — per-user, per-API-key, and global rate limits. Set daily/monthly spend caps for API-based models. Monitor token consumption and alert on anomalies.
- Plan for model updates — pin exact model versions (not "latest"). Test new model versions against your eval suite before deployment. Use blue-green or canary deployment patterns for model swaps.
- Secure API keys and endpoints — use short-lived tokens or API gateways. Never expose model serving endpoints directly to the internet. mTLS between services. Rotate keys regularly.
- Build an evaluation pipeline — automated tests with golden datasets for accuracy, hallucination rate, latency, and cost. Run evals on every model change, prompt change, and RAG pipeline change.
- Configure autoscaling — scale GPU replicas based on request queue depth, not CPU utilization. Use data parallelism (multiple model replicas) for throughput scaling. Pre-warm replicas to avoid cold start latency.
- Document your architecture — record model choice rationale, quantization method, hardware specs, prompt templates, RAG pipeline details, and fallback strategies. Future you will thank present you.