The AI Field
Guide
A comprehensive reference for understanding, building with, and deploying modern AI systems — from prompt fundamentals to autonomous agentic pipelines.
AI Fundamentals & Model Architecture
Modern large language models (LLMs) are transformer-based neural networks trained on massive text corpora. Understanding their core mechanics helps you use them effectively and avoid common pitfalls.
The Context Window
Every model processes text within a fixed "window" of tokens — roughly 4 characters each. Tokens within this window are all equally accessible to the model. Content outside it simply doesn't exist to the model during that call.
Tokenization
Text is split into tokens before processing. Common words are single tokens; rare or technical words split into multiple. "unbelievable" might be 3 tokens. This affects cost, context limits, and how the model processes text.
Temperature & Sampling
Temperature (0–1+) controls randomness. At 0, the model picks the most probable next token deterministically. Higher values introduce creative variation. Use low temps for factual tasks, higher for creative generation.
Autoregressive Generation
Models generate one token at a time, each conditioned on all previous tokens. This means early choices constrain later ones — a key reason why structured output formats and clear early framing help so much.
System vs. User Prompts
System prompts define the model's persona, capabilities, and constraints at setup time. User prompts are the runtime requests. Most models give higher authority to system-level instructions.
Emergent Capabilities
Capabilities like reasoning, in-context learning, and tool use aren't explicitly programmed — they emerge from scale. This is both the magic and the mystery of modern LLMs.
| Model Type | Best For | Limitations | Typical Use Case |
|---|---|---|---|
| frontier/large | Complex reasoning, nuanced writing, multi-step tasks | Slower, more expensive per token | Legal review, architecture design, research synthesis |
| mid-size | Good balance of speed, cost, capability | May struggle with very complex chains | Customer support, coding assistance, summarization |
| small/fast | High throughput, low latency, classification | Limited reasoning depth | Intent routing, triage, simple Q&A |
| fine-tuned | Domain-specific tasks with consistent format | Requires data/training; may lose general capability | Medical coding, financial extraction, brand voice |
| embedding | Semantic search, clustering, similarity | Not generative — produces vectors, not text | RAG retrieval, duplicate detection, recommendation |
Prompt Engineering
Prompting is how you communicate intent to an AI model. It's part art, part science — and the highest-leverage skill for getting reliable, quality outputs.
# SYSTEM PROMPT You are an expert data analyst at a logistics company. Your tone is concise, precise, and uses metric units. Always structure responses as: Summary → Data → Recommendation. Never speculate beyond the provided data. # USER PROMPT Context: Q3 shipment data shows 12% delay increase in Zone B. Task: Identify the top 3 likely causes and rank by impact. Format: Numbered list. Max 3 sentences per item. Output example: 1. [Cause] — [Evidence] — [Recommended action] 2. ...
Core Techniques
- Zero-shot Prompting
Simply ask. Works when task is clear and model has strong priors. - Few-shot Examples
Include 2–5 input/output examples. Dramatically improves format consistency. - Chain-of-Thought
Add "Think step by step" or show reasoning in examples. Unlocks multi-step logic. - Role Assignment
Define who the model is. Primes tone, vocabulary, and domain knowledge activation. - Output Constraints
Specify format, length, structure. Never assume the model knows your preferred format. - Negative Instructions
Explicitly state what NOT to do. "Do not include caveats" outperforms hoping it won't.
Place the most important instruction first and last in your prompt. Models show recency and primacy bias — key constraints buried in the middle are often ignored.
Vague prompts like "write something good about our product" produce mediocre outputs. Specificity is the single biggest lever: audience, format, tone, length, goal, constraints.
Temperature doesn't replace prompt quality. A well-designed prompt at temp 0 usually outperforms a vague prompt at any temperature.
// Prompting the model to reason before answering const prompt = ` Question: If a train leaves at 9:00 AM traveling 80 mph and another leaves at 10:30 AM traveling 100 mph from 400 miles away, when do they meet? Instructions: Work through this step by step, showing your calculations. Then give a final single-sentence answer. `; // Output will contain visible reasoning chain + final answer // This "scratchpad" approach dramatically improves accuracy
Agentic AI & Multi-Step Systems
Agentic AI refers to systems where an LLM autonomously plans, uses tools, and executes multi-step workflows — rather than responding to a single prompt and stopping.
An "agent" is just an LLM in a loop: it perceives state, decides on an action, executes it, observes the result, and repeats — until the task is complete or it determines it can't proceed. — Architectural principle
ReAct (Reason + Act)
The model alternates between "Thought" (reasoning about what to do next), "Action" (invoking a tool), and "Observation" (processing the result). Traces are fully visible and debuggable.
Plan-and-Execute
A "planner" model generates a full task breakdown first. An "executor" then works through each step with access to tools. Separates high-level strategy from low-level execution.
Multi-Agent Systems
Multiple specialized agents coordinate: an orchestrator delegates to sub-agents (researcher, writer, coder). Each has narrow scope and relevant tools. Scales complexity without overloading a single context.
Reflection & Self-Critique
After generating output, the model reviews its own work against criteria, then revises. A separate "critic" model can also evaluate outputs. Substantially improves quality on complex tasks.
Tree of Thought
Explores multiple reasoning paths in parallel (branching like a tree), evaluates each, and pursues the most promising branch. Excellent for problems with many plausible solution paths.
Human-in-the-Loop
Checkpoints where the agent pauses and requests human approval before taking consequential actions. Essential for any task with real-world effects (sending emails, writing code to deploy, etc.).
Start with minimal autonomy. Build agents that run 1–2 steps before requiring human confirmation. Only expand autonomy after you understand failure modes. The biggest agentic failures come from over-trusting a model's plan early in development.
def run_agent(task: str, tools: dict, max_steps: int = 10): messages = [ {"role": "system", "content": AGENT_SYSTEM_PROMPT}, {"role": "user", "content": task} ] for step in range(max_steps): response = llm.call(messages, tool_definitions=tools) # Agent chose to use a tool if response.tool_call: tool_name = response.tool_call.name tool_args = response.tool_call.args tool_result = tools[tool_name](**tool_args) # Execute tool messages.append(response.as_message()) messages.append({ "role": "tool", "content": str(tool_result) }) # Agent is done — return final answer elif response.final_answer: return response.final_answer raise AgentTimeoutError("Max steps reached without resolution")
Memory Systems & State
LLMs have no built-in persistent memory — they start fresh every conversation. All "memory" must be engineered. Understanding the four memory types is foundational for building coherent AI systems.
Everything within the current context window. Fast, zero-latency. Disappears when the session ends. Limited by token budget. This is all the model natively has.
Past interactions stored externally (DB, vector store). Retrieved and injected into context when relevant. Enables persistent user history and long-term continuity.
Structured or embedded knowledge (documents, facts, code). Retrieved via semantic search (RAG). Gives the model access to information beyond its training cutoff.
Capabilities baked into model weights via fine-tuning or RLHF. These are "how to do X" patterns the model learned during training. Not easily updated post-deployment.
Context Window Strategies
-
Summarization Compression
Periodically summarize older conversation turns into a compact representation. Replace verbose history with the summary to free up token budget.
-
Sliding Window
Keep only the last N turns in context. Cheap to implement. Works well when recent context is most relevant (e.g., coding sessions, live support chats).
-
Hierarchical Context
Maintain a multi-level structure: full recent turns + summaries of older turns + a persistent "facts extracted" block for key entities and decisions.
-
Memory Extraction
After each turn, prompt the model: "Extract any new facts worth remembering." Store structured key-value pairs externally and re-inject on next session start.
Token budget management: Reserve at least 20% of your context window for the model's response. If your prompt uses 90% of the window, outputs will be truncated or degraded.
def build_prompt_with_memory(user_msg, user_id): # 1. Retrieve relevant memories memories = memory_store.search( query=user_msg, user_id=user_id, top_k=5 ) # 2. Retrieve recent history history = history_store.get_recent( user_id=user_id, turns=10 ) # 3. Inject into system prompt system = f""" User context: {format_memories(memories)} Recent conversation: {format_history(history)} """ return system, user_msg
Tool Use & Function Calling
Tool use (aka function calling) allows models to invoke external capabilities — APIs, databases, code executors — and incorporate results into their responses. This is what transforms a chatbot into an agent.
tools = [ { "name": "search_knowledge_base", "description": "Search internal documentation. Use when the user asks about company policies, product specs, or procedures. Returns top 3 relevant passages with source IDs.", "input_schema": { "type": "object", "properties": { "query": { "type": "string", "description": "Natural language search query" }, "category": { "type": "string", "enum": ["policies", "products", "procedures"], "description": "Filter results by category" } }, "required": ["query"] } } ] # Key: descriptions are for the MODEL, not developers. # Be explicit about WHEN to use this tool vs. others.
The description field of each tool is your most powerful lever. Describe when to use the tool, what it returns, and any edge cases. A well-described tool with average implementation outperforms a brilliant tool with a vague description.
Information Retrieval
Web search, vector DB search, SQL queries, API reads. Give models access to current, domain-specific, or user-specific data. The most common and safest category.
Code Execution
Sandboxed Python/JS interpreters. Lets models verify math, process data, generate charts. Always sandbox — never expose direct system access to a model.
Write / Mutate Actions
Creating files, sending messages, updating databases, triggering webhooks. Highest risk category — add confirmation steps and human-in-the-loop checkpoints before any irreversible action.
Agent Handoff
Calling specialized sub-agents or routing to different model configurations. Enables modular multi-agent architectures where each agent handles its domain.
Computer Use
Taking screenshots, clicking UI elements, filling forms. Frontier capability — models can interact with arbitrary software as a human would. Requires careful sandboxing and oversight.
Structured Output
Forcing the model to output valid JSON, XML, or typed schemas. Use tools with strict output schemas rather than asking the model to "format as JSON" in free text.
Retrieval-Augmented Generation (RAG)
RAG combines retrieval (finding relevant documents) with generation (producing responses). It grounds model outputs in a specific knowledge base, dramatically reducing hallucination on factual tasks.
Chunking Strategies
| Strategy | When to Use |
|---|---|
| Fixed-size (512 tokens) | Baseline; works for most homogeneous docs |
| Semantic (paragraph/section) | Docs with natural section breaks |
| Sliding window w/ overlap | When key info may span chunk boundaries |
| Hierarchical (doc → section → chunk) | Complex, nested documents |
| Proposition-based | High-precision Q&A requiring single-fact precision |
Retrieval Methods
| Method | Strength |
|---|---|
| Dense (vector cosine) | Semantic similarity, paraphrase-robust |
| Sparse (BM25/TF-IDF) | Exact keyword match, proper nouns |
| Hybrid (dense + sparse) | Best of both — recommended default |
| HyDE | Generate hypothetical answer first; retrieve using it |
| Re-ranking (cross-encoder) | Re-score top-N for precision; adds latency |
Metadata filtering first. Before semantic search, filter by document type, date range, department, or other metadata. Retrieval over 50 filtered documents beats retrieval over 50,000 unfiltered ones — both in speed and relevance.
Don't skip evaluation. RAG systems fail silently. Set up a golden Q&A test set before launch. Measure both retrieval recall (did we fetch the right docs?) and answer faithfulness (did the model hallucinate beyond the retrieved content?) separately.
Safety, Alignment & Guardrails
As AI systems gain agency, safety engineering becomes critical. The goal is ensuring models pursue intended goals and stop when they shouldn't proceed — without requiring constant human supervision.
Input Guardrails
Classify and filter incoming requests before they reach your main model. Use fast, cheap classifiers to detect policy violations, prompt injections, or off-topic requests. Fail loudly and early.
Output Guardrails
Validate model outputs before returning to users or passing to downstream systems. Check for hallucinations, policy violations, PII leakage, or malformed structured outputs. Never trust raw model output blindly.
Prompt Injection Defense
Malicious content in retrieved documents or user inputs can attempt to override system instructions. Use clear delimiters, privilege separation, and never interpolate untrusted content directly into system prompts.
Minimal Privilege
Only grant the model access to tools and data it needs for the current task. An agent answering customer questions doesn't need database write access. Scope tool permissions tightly.
Reversibility Preference
Design agents to prefer reversible over irreversible actions. "Draft email" before "send email." "Stage database change" before "commit." Irreversible actions should always require explicit human confirmation.
Observability & Logging
Log every agent step, tool call, and decision with full context. You cannot debug what you cannot observe. Structured traces are essential for diagnosing failures in complex agentic systems.
LLM outputs are not trusted by default. Treat model outputs like you would user-generated content: sanitize before rendering in HTML, validate before writing to databases, review before sending to external parties.
Production Patterns & Best Practices
Moving from prototype to production requires reliability patterns, cost management, and evaluation frameworks. These are the practices that separate demos from dependable systems.
Reliability
- Retry with Backoff
API calls fail. Always implement exponential backoff with jitter. 3 retries with 1s/2s/4s waits handles 99% of transient errors. - Fallback Models
If primary model is unavailable, fall back to secondary. Design model-agnostic prompt structures to make this easier. - Timeout Budgets
Set explicit timeouts per agent step. An agent stuck in a loop should self-terminate, not run until external kill. - Idempotency
Design tool calls to be safe to retry. Track tool call IDs to prevent duplicate side effects on retry.
Cost Management
- Model Routing
Route simple tasks to fast/cheap models. Use an intent classifier to select which model tier handles each request. - Prompt Caching
Cache system prompts and large static context blocks. Re-use across calls with same prefix. Can cut costs 80%+ on repeated structures. - Response Caching
Cache identical or semantically similar queries. Use semantic cache keys for fuzzy matching. - Batch Processing
Use async batch APIs for non-latency-sensitive workloads. Typically 50% cheaper than synchronous calls.
Unit Tests (Assertions)
Deterministic checks on model output: does it contain required fields? Is JSON valid? Does it avoid banned phrases? Fast, cheap, run on every change.
LLM-as-Judge
A separate (typically larger) model evaluates quality dimensions: accuracy, helpfulness, safety, tone. Scales to thousands of samples without human review. Use structured scoring rubrics.
Human Evaluation
Gold standard for subjective quality. Use for calibration of automated evals and for high-stakes decisions. Prefer pairwise comparisons (A vs. B) over absolute scoring — more reliable.
A/B Testing in Prod
Route a percentage of real traffic to a new model/prompt version. Measure business metrics (task completion, escalation rate, satisfaction) rather than just model metrics.
Start evaluations on day one. Build a small golden test set (50–100 examples) before you build anything else. Every prompt change should run against it. Evals are the unit tests of AI systems — skipping them means shipping blind.
Before Going Live
- Prompt version controlled in git
- Golden eval set created and passing
- Input + output guardrails in place
- All PII fields identified and masked
- API key scoped to minimum permissions
- Retry + timeout logic implemented
- Full request/response logging enabled
- Rate limit handling tested
- Costs estimated and budget alerts set
- Human escalation path defined
For Agentic Systems
- Max steps / iteration limit set
- Every tool call logged with args + result
- Irreversible actions require confirmation
- Minimal privilege enforced on all tools
- Prompt injection vectors reviewed
- Graceful degradation path exists
- State corruption recovery planned
- Human-in-the-loop checkpoint defined
- Failure modes tested with adversarial inputs
- Rollback procedure documented