LLM Production Guide

01

Overview

Large Language Models (LLMs) are deep neural networks trained on massive text corpora to understand and generate human language. Built on the Transformer architecture, they learn statistical patterns in language and generate text autoregressively — predicting the next token given all previous tokens. The scale of these models (billions of parameters) enables emergent capabilities like reasoning, code generation, translation, and instruction following.

Core Tokens & Context Window

Tokens are the atomic units of text the model processes — roughly 3/4 of a word in English. The context window is the maximum number of tokens the model can process in a single forward pass (e.g., 8K, 128K, 1M+). Everything the model reads and writes must fit within this window.

Core Parameters & Weights

Parameters (or weights) are the learned numerical values in the neural network. A 70B model has 70 billion parameters. More parameters generally means more capacity to store knowledge and perform complex reasoning, but requires proportionally more compute and memory.

Concept Inference vs Training

Training is the process of learning weights from data (months on thousands of GPUs, millions of dollars). Inference is using a trained model to generate text (seconds on a single GPU). Fine-tuning adapts a pre-trained model to a specific task with a smaller dataset — much cheaper than training from scratch.

Concept Autoregressive Generation

LLMs generate text one token at a time. At each step, the model computes a probability distribution over all possible next tokens and samples from it. This means generation speed is sequential — you cannot parallelize generating token 5 until token 4 exists. This is why inference optimization matters enormously.

History GPT & BERT Era (2018–2022)

Google's BERT (2018) demonstrated bidirectional understanding. OpenAI's GPT-2 (2019) showed coherent long-form generation. GPT-3 (2020, 175B params) demonstrated few-shot learning — the model could perform tasks just from examples in the prompt, without fine-tuning.

History Post-ChatGPT Explosion (2022–)

ChatGPT (Nov 2022) brought LLMs to the mainstream. Since then: GPT-4/5, Claude, Gemini, Llama, Mistral, Qwen, DeepSeek, and dozens more. Open-source models caught up rapidly. Instruction tuning and RLHF became standard. Context windows expanded from 4K to 1M+ tokens. Reasoning models (o1, o3, R1) introduced chain-of-thought at inference time.

02

Architecture & Internals

Modern LLMs are built on the Transformer architecture introduced in the 2017 paper "Attention Is All You Need." Understanding how transformers work is essential for optimizing inference, choosing hardware, and debugging model behavior.

Core transformer components

Attention Self-Attention

The key innovation. Each token attends to every other token in the sequence, computing relevance scores. This lets the model understand relationships regardless of distance — "the cat sat on the mat because it was tired" — the model learns that "it" refers to "cat" through attention weights.

Attention Multi-Head Attention

Instead of one attention computation, the model runs multiple heads in parallel, each learning different relationship patterns (syntax, semantics, coreference, etc.). Outputs are concatenated and projected. Typical: 32–128 heads depending on model size.

Layer Feed-Forward Network

After attention, each token passes through a position-wise feed-forward network (typically two linear layers with a nonlinearity like SiLU/GELU). This is where most of the model's parameters live — the FFN layers store factual knowledge.

Layer Positional Encoding

Transformers have no inherent notion of token order. Positional encodings inject position information. Modern models use RoPE (Rotary Position Embeddings) which encode relative positions and can be extended to longer sequences than seen during training.

Forward pass as a DAG

KV cache

During autoregressive generation, the model recomputes attention over all previous tokens at each step. The KV cache stores the Key and Value matrices from previous tokens so they don't need to be recomputed. This turns generation from O(n²) to O(n) per step, but the cache grows linearly with sequence length and consumes significant GPU memory.

Decoder-only vs encoder-decoder

Architecture	How it works	Examples
Decoder-only	Processes input and output as a single sequence with causal masking (each token can only attend to previous tokens). Simpler, scales better.	GPT-4, Claude, Llama, Mistral
Encoder-decoder	Encoder processes input bidirectionally, decoder generates output attending to encoder output. Better for structured input→output tasks.	T5, BART, original Transformer

Mixture of Experts (MoE)

MoE models replace the single FFN in each block with multiple expert FFNs and a router network. For each token, the router selects only 2 of (say) 8 experts. This means a 46.7B-parameter MoE model (Mixtral 8x7B) only activates ~12.9B parameters per token, giving near-7B inference speed with much larger total capacity. DeepSeek V3 takes this further with 671B total parameters but only 37B active per token.

Key concept

The fundamental bottleneck in LLM inference is memory bandwidth, not compute. Generating each token requires reading the model weights from GPU memory. A 70B FP16 model is 140GB — even an H100 (3.35 TB/s bandwidth) takes ~42ms just to read the weights once per token. This is why quantization (making weights smaller) directly improves throughput.

03

Open Source Models

The open-source LLM ecosystem has exploded since Meta released Llama 2 in July 2023. Today, open models rival or exceed proprietary ones for many tasks, and can be self-hosted for data privacy, cost control, and low-latency inference. MoE architectures (Llama 4, DeepSeek V3, Qwen 3) have made frontier-quality open models more accessible.

Major model families

Meta Llama 4

Llama 4 (April 2025) introduced MoE architectures: Scout (109B total, 17B active, 16 experts, 10M context) and Maverick (400B total, 17B active, 128 experts, 1M context). Natively multimodal (text + image input). Previous generation Llama 3.1/3.3 (8B, 70B, 405B dense) remain widely deployed. Llama Community License.

Mistral AI Mistral / Mixtral

Mistral Large 3 (675B total, 41B active, MoE, 256K context, multimodal). Ministral 3 (3B, 8B, 14B dense) for edge. Mixtral 8x7B (47B total, 13B active, MoE) remains popular for efficiency. Codestral 25.01 specialized for code. Apache 2.0 license for many models.

Alibaba Qwen 3 / 3.5

Qwen 3 (April 2025): dense (0.6B–32B) and MoE (30B-A3B, 235B-A22B). Qwen 3.5 (Feb 2026): 397B with native multimodal. Excellent multilingual support (especially CJK). Strong coding and math. Apache 2.0 license. Hybrid thinking modes for reasoning.

Google Gemma 3

Sizes: 270M, 1B, 4B, 12B, 27B. Multimodal (image + text) at 4B+. Up to 128K context. Designed for edge and on-device deployment. Built with the same research as Gemini. Good for resource-constrained environments where you need a capable small model.

Microsoft Phi-4

Sizes: 3.8B (Mini), 5.6B (Multimodal), 14B, 15B (Reasoning-Vision). Excels at complex reasoning, math, and coding for its size. Trained on high-quality synthetic data. MIT license. Phi-4-reasoning variants add chain-of-thought capabilities.

DeepSeek DeepSeek V3 / R1

671B MoE (37B active). V3 excels at coding and general tasks; V3.1 and V3.2 (685B) added improved reasoning and agentic capabilities. R1 is a reasoning-focused model using chain-of-thought. Trained at a fraction of the cost of comparable models. Open weights, commercially permissive.

Cohere Command R+

104B parameters. Specifically optimized for RAG (Retrieval-Augmented Generation) with built-in citation generation. Strong tool use and multi-step reasoning. Available via Cohere's API and on cloud marketplaces. CC-BY-NC license for the weights.

VRAM requirements (rule of thumb)

FP16 VRAM ≈ 2× parameter count in GB. Q4 (4-bit quantized) VRAM ≈ 0.5–0.6× parameter count in GB. Add 1–4GB overhead for KV cache depending on context length.

Model	Parameters	FP16 VRAM	Q4 VRAM	Min GPU
Phi-4 Mini	3.8B	~7.6 GB	~2.5 GB	RTX 3060 12GB
Mistral 7B	7B	~14 GB	~4 GB	RTX 4070 Ti 16GB
Llama 3.1 8B	8B	~16 GB	~5 GB	RTX 3090 24GB
Qwen 3 32B	32B	~64 GB	~18 GB	A100 80GB
Llama 3.3 70B	70B	~140 GB	~38 GB	2× A100 80GB
Mixtral 8x7B	47B (13B active)	~94 GB	~26 GB	2× RTX 4090
Llama 4 Maverick	400B (17B active)	~800 GB	~220 GB	8× A100 80GB
DeepSeek V3	671B (37B active)	~1.3 TB	~370 GB	8× H100 80GB

Practical tip

For most production use cases, quantized models are the way to go. A Q4-quantized 70B model running on 2× RTX 4090s often outperforms an FP16 7B model on a single GPU — more knowledge, similar speed, and the quality loss from quantization is minimal for most tasks.

04

Proprietary Models & Cloud

Proprietary models from major cloud providers offer the highest capability for complex tasks, managed infrastructure, and enterprise compliance. The tradeoff is cost, data privacy considerations, and vendor lock-in.

Azure OpenAI (via Azure)

GPT-4.1 (1M context), GPT-4o (multimodal, fast), o3/o4-mini (reasoning models with chain-of-thought), GPT-5 (frontier). Deployed via Azure OpenAI Service — same API as OpenAI, but data stays in your Azure region. Enterprise compliance (SOC 2, HIPAA eligible), content filtering, rate limiting. Pricing: per-token (input tokens cheaper than output). Azure AI Foundry for prototyping and evaluation.

API Anthropic (Claude)

Claude Opus 4.6 (highest capability), Claude Sonnet 4.6 (balanced), Claude Haiku 3.5 (fast, cheap). API-only — no self-hosted option. Also available on AWS Bedrock and Google Vertex AI. 1M token context window. Strong for long-context analysis, coding, and careful instruction following. Excels at reducing hallucination.

GCP Google (Gemini)

Gemini 3 Flash (fast, default), Gemini 3 Pro (frontier reasoning), Gemini 2.5 Pro (widely deployed). Natively multimodal — text, image, video, and audio in a single model. Via Vertex AI (enterprise) or Google AI Studio (prototyping). Vertex AI Model Garden also hosts third-party models (Llama, Claude, Mistral).

AWS AWS Bedrock

Managed service hosting multiple providers under a unified API: Claude, Llama, Mistral, Cohere, Stability AI, and more. Pay-per-token, no infrastructure management. Supports fine-tuning, knowledge bases (RAG), and agents. Data stays within your AWS account. Good for organizations already on AWS who want model flexibility.

Others Additional Providers

Cohere — enterprise NLP, embeddings (Embed v3), RAG-optimized Command models. AI21 — Jamba (SSM-Transformer hybrid). xAI — Grok models. Perplexity — search-augmented generation, good for factual queries. Groq — ultra-fast inference on custom LPU hardware. Together AI, Fireworks AI, DeepInfra — hosted open-source model APIs with competitive pricing.

Provider comparison

Provider	Top Model	Context	Strengths	Pricing Model
OpenAI / Azure	GPT-4.1, o3, GPT-5	1M	Broadest capability, multimodal, reasoning	Per-token (input/output)
Anthropic	Claude Opus 4.6	1M	Long-context, coding, low hallucination	Per-token (input/output)
Google	Gemini 3 Pro	2M	Native multimodal, huge context, integration	Per-token / per-character
AWS Bedrock	Multi-provider	Varies	Unified API, model flexibility, AWS ecosystem	Per-token (varies by model)
Cohere	Command A / R+	128K	RAG-optimized, enterprise embeddings	Per-token

Data residency

When using proprietary APIs, your prompts and completions leave your infrastructure. For sensitive data, use Azure OpenAI Service (data stays in region, no training on your data), AWS Bedrock (data stays in your VPC), or self-host open models. Always review the provider's data processing agreement and retention policies.

05

vLLM

vLLM is the leading open-source high-throughput serving engine for LLMs. It dramatically improves inference performance through PagedAttention and continuous batching, making it the go-to choice for production self-hosted deployments.

Key innovations

Core PagedAttention

Manages the KV cache like virtual memory pages. Traditional serving pre-allocates contiguous memory for each request's max sequence length, wasting 60–80% of GPU memory. PagedAttention allocates KV cache in non-contiguous blocks on demand, enabling near-zero memory waste and 2–4× more concurrent requests.

Core Continuous Batching

Traditional batching waits for the longest sequence to finish before starting new requests. Continuous batching dynamically adds and removes requests mid-batch at each generation step. This keeps GPU utilization high and reduces tail latency significantly.

Feature Tensor Parallelism

Splits model layers across multiple GPUs. A 70B model can be served across 2× or 4× GPUs with near-linear throughput scaling. vLLM handles the inter-GPU communication (NCCL) automatically.

Feature OpenAI-Compatible API

vLLM serves an OpenAI-compatible REST API out of the box. Drop-in replacement — change the base URL from api.openai.com to your vLLM server and existing code works. Supports /v1/chat/completions, /v1/completions, and /v1/models.

Starting a vLLM server

# Install vLLM
pip install vllm

# Start OpenAI-compatible server with tensor parallelism
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

# With quantized model (AWQ)
vllm serve TheBloke/Llama-2-70B-Chat-AWQ \
  --quantization awq \
  --tensor-parallel-size 4

# Query the server (same as OpenAI API)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Explain PagedAttention"}],
    "max_tokens": 512,
    "temperature": 0.7
  }'

Serving frameworks comparison

Framework	Developer	Key Feature	Best For
vLLM	UC Berkeley / community	PagedAttention, continuous batching	High-throughput production serving
TGI	Hugging Face	Flash Attention, token streaming	HuggingFace ecosystem integration
TensorRT-LLM	NVIDIA	Kernel-level optimization, FP8	Maximum single-request latency on NVIDIA GPUs
llama.cpp	ggerganov	CPU inference, GGUF quantization	Local/edge deployment, CPU+GPU hybrid
SGLang	LMSYS	RadixAttention, structured generation	Constrained decoding, JSON output

Recommendation

For most production deployments, start with vLLM. It has the best balance of throughput, ease of use, and model support. Switch to TensorRT-LLM only if you need absolute minimum latency and are willing to deal with the build/compilation complexity. Use llama.cpp for local development and CPU-only environments.

06

Ollama & Open WebUI

Ollama makes running LLMs locally as simple as running Docker containers. Open WebUI provides a ChatGPT-like web interface on top. Together, they form the fastest path to running models on your own hardware.

Ollama

Ollama wraps llama.cpp in a user-friendly CLI with a model registry, automatic GGUF quantization handling, and a REST API. It manages model downloads, GPU detection, and memory allocation automatically.

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Run a model (downloads automatically on first use)
ollama run llama3.1
ollama run mistral
ollama run codellama:34b

# Pull a model without running it
ollama pull qwen2.5:72b

# List downloaded models
ollama list

# Show model details (parameters, quantization, size)
ollama show llama3.1

# REST API (default: localhost:11434)
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.1", "prompt": "What is PagedAttention?"}'

# Chat API (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Modelfile (custom models)

# Modelfile — like a Dockerfile for LLMs
FROM llama3.1

# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 8192

# System prompt
SYSTEM """You are a senior DevOps engineer. Answer questions about
infrastructure, CI/CD, and cloud architecture. Be concise and
provide code examples when relevant."""

# Build and run
# ollama create devops-assistant -f Modelfile
# ollama run devops-assistant

Open WebUI

Open WebUI is a self-hosted ChatGPT-style interface that connects to Ollama (or any OpenAI-compatible API). Features include multi-model chat, RAG with document upload, conversation history, user management, and model management.

# Deploy Open WebUI with Docker (connects to Ollama on host)
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data --name open-webui \
  ghcr.io/open-webui/open-webui:main

# With GPU support and bundled Ollama
docker run -d -p 3000:8080 --gpus all \
  -v ollama:/root/.ollama -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:ollama

# Access at http://localhost:3000
# First user to sign up becomes admin

Feature Open WebUI Highlights

Multi-model conversations (switch models mid-chat)
RAG: upload PDFs, docs — model answers from your documents
Conversation branching and regeneration
User management with role-based access
Model management (pull/delete from UI)
Custom prompts library
Web search integration

Alt LM Studio

LM Studio is a desktop application (macOS, Windows, Linux) for running LLMs locally. GUI-based model discovery and download from HuggingFace. Built-in chat interface and OpenAI-compatible local server. Good for non-technical users or quick experimentation. Supports GGUF models with automatic GPU offloading.

07

Optimization & Quantization

Running large models on limited hardware requires aggressive optimization. Quantization is the most impactful technique — reducing numerical precision to shrink model size and increase throughput with minimal quality loss.

Quantization methods

Format GGUF (llama.cpp)

The standard format for CPU+GPU hybrid inference. Supports Q2 through Q8 quantization levels. Models can be partially offloaded to GPU. Most flexible for consumer hardware. Used by Ollama, LM Studio, and llama.cpp directly.

Format GPTQ

GPU-focused post-training quantization. Uses calibration data to minimize quantization error. 4-bit and 8-bit variants. Fast inference on NVIDIA GPUs via auto-gptq or exllama. Slightly better quality than naive round-to-nearest quantization.

Format AWQ (Activation-Aware)

Identifies the 1% of weights that matter most (based on activation magnitudes) and preserves them at higher precision. Better quality than GPTQ at the same bit width. Supported natively by vLLM. The recommended quantization for production GPU serving.

Format bitsandbytes

On-the-fly quantization during model loading. No pre-quantized model needed. load_in_4bit or load_in_8bit in HuggingFace Transformers. Convenient for experimentation but slower than pre-quantized formats. Enables QLoRA fine-tuning.

Precision comparison

Precision	Bits/Param	VRAM (7B)	Quality Impact	Use Case
FP32	32	~28 GB	Baseline (training)	Training only
FP16 / BF16	16	~14 GB	Negligible vs FP32	Default inference
INT8	8	~7 GB	Minimal (<1% quality loss)	Production serving
INT4 / Q4	4	~4 GB	Small (1–3% quality loss)	Consumer GPUs, edge
Q2 / Q3	2–3	~2–3 GB	Noticeable degradation	Extreme constraints only

Attention optimizations

Memory Flash Attention

Rewrites the attention computation to be IO-aware. Instead of materializing the full N×N attention matrix in GPU HBM, it computes attention in tiles that fit in SRAM. Reduces memory from O(n²) to O(n) and is 2–4× faster. Now standard in all major frameworks (Flash Attention 2/3).

Memory GQA & MQA

GQA (Grouped Query Attention): multiple query heads share a single KV head group, reducing KV cache size by 4–8×. Used by Llama 2/3/4, Mistral, Qwen. MQA (Multi-Query Attention): all query heads share one KV head. Even smaller cache but slightly lower quality. Used by Falcon, PaLM.

Advanced techniques

Speed Speculative Decoding

Use a small draft model (e.g., 1B) to predict 4–8 tokens ahead. The large model verifies all draft tokens in a single forward pass (parallel). If the draft is correct, you get multiple tokens for the cost of one large-model call. Typically 2–3× speedup for greedy decoding.

Size Pruning & Distillation

Pruning: remove weights that contribute little (structured or unstructured sparsity). Distillation: train a smaller "student" model to mimic a larger "teacher." Combined: distill a 70B model into a 7B model that retains 80–90% of the teacher's quality on your specific task.

Which quantization for your hardware?

16–24 GB Consumer GPU

RTX 3090/4090 (24GB) or RTX 5090 (32GB). Use Q4 GGUF via Ollama for up to 13B models. Use AWQ 4-bit via vLLM for maximum throughput. Can fit a Q4 34B model with careful memory management.

48–80 GB Single Data Center GPU

A100 40/80GB or H100 80GB. Run 7–13B models at FP16, or 70B at Q4/Q8. AWQ quantization via vLLM for production serving. FP8 on H100 for near-FP16 quality at half the memory.

Multi-GPU 2–8 GPUs

Run 70B+ models at FP16 with tensor parallelism. 2× A100 80GB = 160GB for a 70B FP16 model. 8× H100 = 640GB for a 405B model at Q4. Use vLLM or TensorRT-LLM for multi-GPU coordination.

CPU Only No GPU

Use llama.cpp / Ollama with Q4 GGUF models. Expect 2–10 tokens/sec for 7B models depending on CPU cores and RAM speed. Viable for low-traffic APIs and development. Needs 8–16GB RAM for 7B Q4.

08

Small Models & Edge

Not every problem needs GPT-4. Small models (under 4B parameters) can run on phones, laptops, Raspberry Pis, and embedded systems. For narrow, well-defined tasks, a fine-tuned small model often beats a general-purpose large model — at a fraction of the cost and latency.

Notable small models

3.8B Phi-4 Mini

Microsoft's small powerhouse. Runs on phones and laptops. Strong reasoning for its size, trained on high-quality synthetic data. 128K context window. Available in ONNX format for optimized mobile inference.

1B–4B Gemma 3

Google's edge-optimized models. Gemma 3 1B (text-only, 32K context) and 4B (multimodal, 128K context). Small enough for IoT devices with GPU acceleration. Good for classification, extraction, and simple generation tasks. Competitive with much larger models on narrow tasks after fine-tuning.

1.1B TinyLlama

Trained on 3T tokens (same as Llama 2 7B's training data volume). Surprisingly capable for 1.1B parameters. Good for embedded systems, rapid prototyping, and as a draft model for speculative decoding.

0.6B Qwen 3 0.6B

Ultra-lightweight, runs on almost anything. Good for simple classification, entity extraction, and template-based generation. Useful when you need a model that responds in single-digit milliseconds.

135M–1.7B SmolLM2 (HuggingFace)

HuggingFace's family of tiny models: 135M, 360M, and 1.7B. The 1.7B variant is trained on 11T tokens. The 135M model is small enough to run in a browser via WebAssembly. Good for research, education, and extremely constrained environments.

Edge use cases

On-device assistants — Siri/Google-style assistants that run locally, preserving privacy and working offline
Code completion in IDEs — sub-100ms completions using small code models (e.g., StarCoder2 3B, Codestral Mini)
Smart home / IoT — natural language control of devices without cloud dependency
Offline translation — small translation models for fieldwork, travel, or air-gapped environments
Classification & extraction — sentiment analysis, named entity recognition, intent detection — fine-tuned small models match or beat large models
Structured data extraction — parse invoices, receipts, medical records into JSON locally

When small beats large

Win Narrow Domain

A LoRA fine-tuned 3B model on your specific domain data (legal docs, medical records, your codebase) often outperforms GPT-4 on that domain — because it's seen thousands of your examples vs zero.

Win Speed & Cost

A 1B model generates at 100+ tokens/sec on a consumer GPU vs 30 tokens/sec for a 70B model. At scale, this means 100× lower compute cost per token. For high-throughput classification APIs, small models are the only viable option.

Win Privacy & Offline

When data cannot leave the device (healthcare, legal, government), small on-device models are the only option. No API calls, no data in transit, no third-party access. Full compliance with data sovereignty requirements.

Win Latency-Critical

Real-time applications (autocomplete, voice assistants, game NPCs) need sub-50ms first-token latency. Only small, locally-running models can achieve this consistently without network roundtrips.

Recommendation

Start with the smallest model that meets your quality threshold. Fine-tune with LoRA/QLoRA on your specific task data. Evaluate rigorously. Only move to larger models if the small model truly cannot meet your quality bar. Many teams jump to frontier APIs by default when a fine-tuned Phi-4 or Gemma 3 would have worked at 1/100th the cost.

09

Tools & Augmentation

LLMs alone are limited by their training data cutoff and inability to take actions. Tools and augmentation techniques connect models to real-time data, external systems, and multi-step workflows.

Function / tool calling

Models can output structured JSON to invoke functions you define. The model decides when to call a tool based on the user's request. Supported by OpenAI, Claude, Gemini, and most open models with instruction tuning.

# OpenAI function calling example
import openai

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["city"]
        }
    }
}]

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What's the weather in Toronto?"}],
    tools=tools,
    tool_choice="auto"
)
# Model responds with: tool_calls[0].function.name = "get_weather"
#                       tool_calls[0].function.arguments = '{"city": "Toronto", "unit": "celsius"}'

RAG (Retrieval-Augmented Generation)

RAG grounds model responses in your own data. The pattern: embed documents into vectors, store in a vector database, retrieve relevant chunks at query time, and include them in the prompt.

User Query | [Embed Query] — text → vector (e.g., 1536 dimensions) | [Vector DB Search] — find top-K similar document chunks | [Build Prompt] = system prompt + retrieved chunks + user query | [LLM Generate] — answer grounded in retrieved context | Response (with citations)

DB Vector Databases

pgvector — PostgreSQL extension, simplest if you already run Postgres
Chroma — lightweight, embedded, good for prototyping
Pinecone — fully managed SaaS, scales to billions of vectors
Weaviate — open-source, hybrid search (vector + keyword)
Milvus — open-source, high performance, distributed

Tips RAG Best Practices

Chunk documents into 256–512 token segments with overlap
Use a strong embedding model (Cohere Embed v3, OpenAI text-embedding-3-large)
Retrieve 5–10 chunks, rerank with a cross-encoder before passing to LLM
Include metadata (source, page, date) so the model can cite sources
Test with both relevant and adversarial queries

MCP (Model Context Protocol)

The Model Context Protocol is Anthropic's open standard for connecting LLMs to external tools and data sources. Think of it as USB-C for AI — a standardized interface that any model can use to connect to any tool.

Architecture MCP Components

Host — the application (Claude Desktop, IDE, your app)
Client — maintains 1:1 connection with a server
Server — exposes tools, resources, and prompts

Transport: stdio (local processes) or Streamable HTTP (bidirectional via a single endpoint for remote servers; replaces the deprecated SSE transport).

vs MCP vs Function Calling

Function calling: stateless, per-request tool definitions
MCP: persistent connections, dynamic tool discovery
MCP servers can expose resources (read data) and prompts (reusable templates) in addition to tools
MCP enables a server ecosystem — install once, use across all MCP clients

// Example MCP server tool definition
{
  "name": "query_database",
  "description": "Execute a read-only SQL query against the production database",
  "inputSchema": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "SQL SELECT query to execute"
      }
    },
    "required": ["query"]
  }
}

Agents

Agents are LLMs that plan and execute multi-step tasks using tools. The model operates in a ReAct loop: Reason about what to do, Act by calling a tool, Observe the result, and repeat until the task is complete.

# Simplified ReAct agent loop
while not task_complete:
    # 1. Reason: LLM decides what to do next
    response = llm.generate(
        system="You are an agent. Use tools to accomplish the task.",
        messages=conversation_history
    )

    # 2. Act: Execute tool calls from the response
    if response.tool_calls:
        for call in response.tool_calls:
            result = execute_tool(call.name, call.arguments)
            conversation_history.append(tool_result(call.id, result))

    # 3. Observe: LLM sees the result and decides next step
    else:
        task_complete = True  # LLM responded with text, not a tool call

Agent frameworks

Popular frameworks: LangGraph (LangChain's agent framework, graph-based workflows), CrewAI (multi-agent collaboration), AutoGen (Microsoft, multi-agent conversations), Claude Code (Anthropic's coding agent). For simple tool-calling patterns, you often don't need a framework — a while loop with the model's tool calling API is sufficient.

10

Infrastructure & GPUs

Choosing the right hardware for LLM inference is critical. The key constraint is GPU memory (VRAM) — the entire model (or its quantized version) plus the KV cache must fit in GPU memory for efficient inference.

GPU comparison

GPU	VRAM	Memory BW	FP16 TFLOPS	Price Range	Best For
RTX 3090	24 GB GDDR6X	936 GB/s	71	~$800 used	Budget inference, dev
RTX 4090	24 GB GDDR6X	1,008 GB/s	165	~$1,800	Best consumer GPU
RTX 5090	32 GB GDDR7	1,792 GB/s	209	~$2,000	Consumer, more VRAM
A100 (80GB)	80 GB HBM2e	2,039 GB/s	312	~$1.50/hr cloud	Production inference
H100 (SXM)	80 GB HBM3	3,350 GB/s	990	~$2.50/hr cloud	High-throughput serving
H200	141 GB HBM3e	4,800 GB/s	990	~$3.50/hr cloud	Large models, long context
B200	192 GB HBM3e	8,000 GB/s	2,250	~$5/hr cloud	Frontier model serving

Multi-GPU strategies

Strategy Tensor Parallelism

Split each layer's weight matrices across GPUs. Each GPU computes part of every layer, then they synchronize. Best for inference latency — all GPUs work on every token. Requires fast interconnect (NVLink). Use when model doesn't fit on one GPU.

Strategy Pipeline Parallelism

Assign different layers to different GPUs (GPU 0 = layers 0–19, GPU 1 = layers 20–39). Tokens flow through GPUs sequentially. Lower interconnect requirements but introduces pipeline bubbles. Better for training than inference.

Strategy Data Parallelism

Replicate the full model on each GPU, process different requests on each replica. Best for throughput when the model fits on a single GPU. Simple to set up — just run multiple vLLM instances behind a load balancer.

Strategy Expert Parallelism (MoE)

For Mixture of Experts models, distribute different experts across GPUs. The router sends each token to the GPU(s) hosting its assigned experts. Efficient because each token only needs 2 of N experts, so inter-GPU communication is limited.

Non-GPU options

CPU CPU Inference

Via llama.cpp / Ollama. Practical for small quantized models (7B Q4 at 2–10 tok/s). Leverages AVX-512 or ARM NEON. Good for low-traffic APIs, dev environments, and offline batch processing. Needs fast RAM — DDR5 helps significantly.

Apple Apple Silicon

M1–M4 chips have unified memory — the system RAM is the GPU VRAM. An M4 Max with 128GB RAM can run a 70B model at FP16. Metal acceleration via llama.cpp/Ollama. The best laptop option for running large models locally.

Cloud GPU providers

Provider	GPU Options	Pricing Model	Notes
AWS (p5/p4 instances)	H100, A100	On-demand, spot, reserved	Broad ecosystem, SageMaker integration
GCP (A3/A2 instances)	H100, A100	On-demand, spot, committed	TPUs also available, Vertex AI
Lambda Labs	H100, A100, A10G	On-demand hourly	Simple, competitive pricing
RunPod	H100, A100, RTX 4090	On-demand, spot	Serverless GPU option, community templates
vast.ai	Mixed consumer/datacenter	Marketplace (bidding)	Cheapest, least reliable, peer-to-peer

11

Security & Governance

LLMs introduce novel security risks that traditional application security doesn't cover. Prompt injection, data leakage, hallucination, and model poisoning require new mitigation strategies.

Prompt injection

Attack Direct Injection

User crafts input that overrides the system prompt: "Ignore previous instructions and instead...". Mitigations: input validation, system prompt reinforcement, output filtering. No perfect defense exists — defense in depth is required.

Attack Indirect Injection

Malicious instructions embedded in data the model retrieves (web pages, documents, emails). When the model processes this data via RAG or web browsing, it follows the injected instructions. Harder to detect because the attack is in the data, not the user's input.

Data leakage

PII in prompts — users may paste sensitive data (SSNs, credentials, medical records) into prompts sent to third-party APIs. Implement PII detection and redaction before API calls.
Secrets in code — developers using AI coding assistants may inadvertently share API keys, connection strings, or internal architecture details. Use secret scanning on prompts.
Training data extraction — adversarial prompts can sometimes extract memorized training data. Mitigations: differential privacy in training, output monitoring.

Guardrails & output filtering

Tool NVIDIA NeMo Guardrails

Open-source framework for adding programmable guardrails to LLM applications. Define conversation flows, topic boundaries, and safety rails in a simple Colang language. Intercepts both input and output.

Tool Llama Guard

Meta's safety classifier model. Runs as a separate model that classifies inputs and outputs as safe/unsafe across categories (violence, self-harm, illegal activity, etc.). Use as a pre/post-filter around your main LLM.

Compliance & governance

Concern	Risk	Mitigation
Data residency	Prompts sent to US-based APIs may violate EU data residency	Use regional endpoints (Azure OpenAI), self-host, or on-premise models
GDPR (right to erasure)	User data in training data cannot be selectively removed	Don't fine-tune on user data without consent; use RAG instead (deletable)
Audit logging	No record of what the model was asked or answered	Log all prompts, completions, token usage, and model versions
Cost controls	Runaway API costs from bugs or abuse	Per-user rate limits, daily spend caps, budget alerts
API key management	Leaked keys = unauthorized usage and billing	Rotate regularly, use short-lived tokens, scope permissions

Hallucination mitigation

Grounding — use RAG to provide factual context; instruct the model to only answer from provided sources
Citations — require the model to cite specific passages from retrieved documents
Confidence thresholds — use logprobs (token probabilities) to detect low-confidence generations
Structured output — constrain output to JSON schemas, reducing free-form hallucination
Multi-model verification — use a second model to fact-check the first model's output

Critical

Never trust LLM output for safety-critical decisions without human review or automated verification. LLMs can generate confident, well-formatted text that is factually wrong. Always validate against authoritative sources, especially for medical, legal, and financial applications.

12

Production Checklist

Choose the right model size — start with the smallest model that meets your quality threshold. Benchmark on your specific task before committing. A fine-tuned 8B model may outperform a generic 70B model on your domain.
Size your hardware — calculate VRAM needs: model weights + KV cache + overhead. Plan for peak concurrent requests, not average. Include 10–20% memory headroom for stability.
Pick a quantization strategy — use AWQ for GPU serving (vLLM), GGUF for CPU+GPU hybrid (Ollama). Benchmark quality on your eval set before and after quantization. Q4 is the sweet spot for most deployments.
Deploy with a production serving framework — use vLLM or TensorRT-LLM, not raw HuggingFace Transformers. Enable continuous batching and PagedAttention for throughput. Set max-model-len to limit KV cache growth.
Implement health checks and monitoring — monitor GPU utilization, VRAM usage, request latency (P50/P95/P99), throughput (tokens/sec), queue depth, and error rates. Alert on GPU memory > 90%, latency spikes, and OOM kills.
Set up input/output guardrails — filter PII from inputs before they reach the model. Validate outputs against schemas. Use safety classifiers (Llama Guard) for user-facing applications. Log everything.
Implement rate limiting and cost controls — per-user, per-API-key, and global rate limits. Set daily/monthly spend caps for API-based models. Monitor token consumption and alert on anomalies.
Plan for model updates — pin exact model versions (not "latest"). Test new model versions against your eval suite before deployment. Use blue-green or canary deployment patterns for model swaps.
Secure API keys and endpoints — use short-lived tokens or API gateways. Never expose model serving endpoints directly to the internet. mTLS between services. Rotate keys regularly.
Build an evaluation pipeline — automated tests with golden datasets for accuracy, hallucination rate, latency, and cost. Run evals on every model change, prompt change, and RAG pipeline change.
Configure autoscaling — scale GPU replicas based on request queue depth, not CPU utilization. Use data parallelism (multiple model replicas) for throughput scaling. Pre-warm replicas to avoid cold start latency.
Document your architecture — record model choice rationale, quantization method, hardware specs, prompt templates, RAG pipeline details, and fallback strategies. Future you will thank present you.