The AI Practitioner's Field Guide

01

Foundation

AI Fundamentals & Model Architecture

Modern large language models (LLMs) are transformer-based neural networks trained on massive text corpora. Understanding their core mechanics helps you use them effectively and avoid common pitfalls.

Core Concept

🧠

The Context Window

Every model processes text within a fixed "window" of tokens — roughly 4 characters each. Tokens within this window are all equally accessible to the model. Content outside it simply doesn't exist to the model during that call.

Core Concept

🔢

Tokenization

Text is split into tokens before processing. Common words are single tokens; rare or technical words split into multiple. "unbelievable" might be 3 tokens. This affects cost, context limits, and how the model processes text.

Core Concept

🌡️

Temperature & Sampling

Temperature (0–1+) controls randomness. At 0, the model picks the most probable next token deterministically. Higher values introduce creative variation. Use low temps for factual tasks, higher for creative generation.

Architecture

🔄

Autoregressive Generation

Models generate one token at a time, each conditioned on all previous tokens. This means early choices constrain later ones — a key reason why structured output formats and clear early framing help so much.

Architecture

📐

System vs. User Prompts

System prompts define the model's persona, capabilities, and constraints at setup time. User prompts are the runtime requests. Most models give higher authority to system-level instructions.

Key Insight

⚡

Emergent Capabilities

Capabilities like reasoning, in-context learning, and tool use aren't explicitly programmed — they emerge from scale. This is both the magic and the mystery of modern LLMs.

Model Type	Best For	Limitations	Typical Use Case
frontier/large	Complex reasoning, nuanced writing, multi-step tasks	Slower, more expensive per token	Legal review, architecture design, research synthesis
mid-size	Good balance of speed, cost, capability	May struggle with very complex chains	Customer support, coding assistance, summarization
small/fast	High throughput, low latency, classification	Limited reasoning depth	Intent routing, triage, simple Q&A
fine-tuned	Domain-specific tasks with consistent format	Requires data/training; may lose general capability	Medical coding, financial extraction, brand voice
embedding	Semantic search, clustering, similarity	Not generative — produces vectors, not text	RAG retrieval, duplicate detection, recommendation

02

Core Skill

Prompt Engineering

Prompting is how you communicate intent to an AI model. It's part art, part science — and the highest-leverage skill for getting reliable, quality outputs.

Anatomy of a Great Prompt

Prompt Template
# SYSTEM PROMPT
You are an expert data analyst at a logistics company.
Your tone is concise, precise, and uses metric units.
Always structure responses as: Summary → Data → Recommendation.
Never speculate beyond the provided data.

# USER PROMPT
Context: Q3 shipment data shows 12% delay increase in Zone B.

Task: Identify the top 3 likely causes and rank by impact.

Format: Numbered list. Max 3 sentences per item.

Output example:
1. [Cause] — [Evidence] — [Recommended action]
2. ...

Core Techniques

Zero-shot Prompting
Simply ask. Works when task is clear and model has strong priors.
Few-shot Examples
Include 2–5 input/output examples. Dramatically improves format consistency.
Chain-of-Thought
Add "Think step by step" or show reasoning in examples. Unlocks multi-step logic.
Role Assignment
Define who the model is. Primes tone, vocabulary, and domain knowledge activation.
Output Constraints
Specify format, length, structure. Never assume the model knows your preferred format.
Negative Instructions
Explicitly state what NOT to do. "Do not include caveats" outperforms hoping it won't.

Place the most important instruction first and last in your prompt. Models show recency and primacy bias — key constraints buried in the middle are often ignored.

Vague prompts like "write something good about our product" produce mediocre outputs. Specificity is the single biggest lever: audience, format, tone, length, goal, constraints.

Temperature doesn't replace prompt quality. A well-designed prompt at temp 0 usually outperforms a vague prompt at any temperature.

Prompt Patterns Reference

Chain-of-Thought Example
// Prompting the model to reason before answering
const prompt = `
Question: If a train leaves at 9:00 AM traveling 80 mph and another
leaves at 10:30 AM traveling 100 mph from 400 miles away, when do they meet?

Instructions: Work through this step by step, showing your calculations.
Then give a final single-sentence answer.
`;

// Output will contain visible reasoning chain + final answer
// This "scratchpad" approach dramatically improves accuracy

03

Advanced Architecture

Agentic AI & Multi-Step Systems

Agentic AI refers to systems where an LLM autonomously plans, uses tools, and executes multi-step workflows — rather than responding to a single prompt and stopping.

An "agent" is just an LLM in a loop: it perceives state, decides on an action, executes it, observes the result, and repeats — until the task is complete or it determines it can't proceed. — Architectural principle

The Agent Loop

🎯Goal / Task

→

🤔Plan & Reason

→

🔧Select Tool

→

⚡Execute

→

👁️Observe Result

→

🔁Loop or Done

Pattern

🤖

ReAct (Reason + Act)

The model alternates between "Thought" (reasoning about what to do next), "Action" (invoking a tool), and "Observation" (processing the result). Traces are fully visible and debuggable.

ReliableTransparent

Pattern

🗓️

Plan-and-Execute

A "planner" model generates a full task breakdown first. An "executor" then works through each step with access to tools. Separates high-level strategy from low-level execution.

StructuredComposable

Pattern

🌐

Multi-Agent Systems

Multiple specialized agents coordinate: an orchestrator delegates to sub-agents (researcher, writer, coder). Each has narrow scope and relevant tools. Scales complexity without overloading a single context.

ScalableSpecialized

Pattern

🔍

Reflection & Self-Critique

After generating output, the model reviews its own work against criteria, then revises. A separate "critic" model can also evaluate outputs. Substantially improves quality on complex tasks.

QualityIterative

Pattern

🌲

Tree of Thought

Explores multiple reasoning paths in parallel (branching like a tree), evaluates each, and pursues the most promising branch. Excellent for problems with many plausible solution paths.

Complex Tasks

Pattern

🔃

Human-in-the-Loop

Checkpoints where the agent pauses and requests human approval before taking consequential actions. Essential for any task with real-world effects (sending emails, writing code to deploy, etc.).

SafeRecommended

Start with minimal autonomy. Build agents that run 1–2 steps before requiring human confirmation. Only expand autonomy after you understand failure modes. The biggest agentic failures come from over-trusting a model's plan early in development.

Simple ReAct Agent Loop (Python Pseudocode)
def run_agent(task: str, tools: dict, max_steps: int = 10):
    messages = [
        {"role": "system", "content": AGENT_SYSTEM_PROMPT},
        {"role": "user",   "content": task}
    ]

    for step in range(max_steps):
        response = llm.call(messages, tool_definitions=tools)

        # Agent chose to use a tool
        if response.tool_call:
            tool_name = response.tool_call.name
            tool_args = response.tool_call.args
            tool_result = tools[tool_name](**tool_args)  # Execute tool

            messages.append(response.as_message())
            messages.append({
                "role": "tool",
                "content": str(tool_result)
            })

        # Agent is done — return final answer
        elif response.final_answer:
            return response.final_answer

    raise AgentTimeoutError("Max steps reached without resolution")

04

State Management

Memory Systems & State

LLMs have no built-in persistent memory — they start fresh every conversation. All "memory" must be engineered. Understanding the four memory types is foundational for building coherent AI systems.

Type 1 — In-Context

Working Memory

Everything within the current context window. Fast, zero-latency. Disappears when the session ends. Limited by token budget. This is all the model natively has.

Type 2 — External Store

Episodic Memory

Past interactions stored externally (DB, vector store). Retrieved and injected into context when relevant. Enables persistent user history and long-term continuity.

Type 3 — Semantic Store

Knowledge Base

Structured or embedded knowledge (documents, facts, code). Retrieved via semantic search (RAG). Gives the model access to information beyond its training cutoff.

Type 4 — Procedural

Skill / Fine-tuned Weights

Capabilities baked into model weights via fine-tuning or RLHF. These are "how to do X" patterns the model learned during training. Not easily updated post-deployment.

Context Window Strategies

Summarization Compression

Periodically summarize older conversation turns into a compact representation. Replace verbose history with the summary to free up token budget.
Sliding Window

Keep only the last N turns in context. Cheap to implement. Works well when recent context is most relevant (e.g., coding sessions, live support chats).
Hierarchical Context

Maintain a multi-level structure: full recent turns + summaries of older turns + a persistent "facts extracted" block for key entities and decisions.
Memory Extraction

After each turn, prompt the model: "Extract any new facts worth remembering." Store structured key-value pairs externally and re-inject on next session start.

Token budget management: Reserve at least 20% of your context window for the model's response. If your prompt uses 90% of the window, outputs will be truncated or degraded.

Memory Injection Pattern
def build_prompt_with_memory(user_msg, user_id):
    # 1. Retrieve relevant memories
    memories = memory_store.search(
        query=user_msg,
        user_id=user_id,
        top_k=5
    )

    # 2. Retrieve recent history
    history = history_store.get_recent(
        user_id=user_id,
        turns=10
    )

    # 3. Inject into system prompt
    system = f"""
User context:
{format_memories(memories)}

Recent conversation:
{format_history(history)}
    """
    return system, user_msg

05

Function Calling

Tool Use & Function Calling

Tool use (aka function calling) allows models to invoke external capabilities — APIs, databases, code executors — and incorporate results into their responses. This is what transforms a chatbot into an agent.

Tool Definition Schema (Anthropic API format)
tools = [
  {
    "name": "search_knowledge_base",
    "description": "Search internal documentation. Use when the user asks
      about company policies, product specs, or procedures.
      Returns top 3 relevant passages with source IDs.",
    "input_schema": {
      "type": "object",
      "properties": {
        "query": {
          "type": "string",
          "description": "Natural language search query"
        },
        "category": {
          "type": "string",
          "enum": ["policies", "products", "procedures"],
          "description": "Filter results by category"
        }
      },
      "required": ["query"]
    }
  }
]

# Key: descriptions are for the MODEL, not developers.
# Be explicit about WHEN to use this tool vs. others.

The description field of each tool is your most powerful lever. Describe when to use the tool, what it returns, and any edge cases. A well-described tool with average implementation outperforms a brilliant tool with a vague description.

Tool Category

🔍

Information Retrieval

Web search, vector DB search, SQL queries, API reads. Give models access to current, domain-specific, or user-specific data. The most common and safest category.

Tool Category

💻

Code Execution

Sandboxed Python/JS interpreters. Lets models verify math, process data, generate charts. Always sandbox — never expose direct system access to a model.

Tool Category

✍️

Write / Mutate Actions

Creating files, sending messages, updating databases, triggering webhooks. Highest risk category — add confirmation steps and human-in-the-loop checkpoints before any irreversible action.

Tool Category

🤝

Agent Handoff

Calling specialized sub-agents or routing to different model configurations. Enables modular multi-agent architectures where each agent handles its domain.

Tool Category

🖥️

Computer Use

Taking screenshots, clicking UI elements, filling forms. Frontier capability — models can interact with arbitrary software as a human would. Requires careful sandboxing and oversight.

Tool Category

📊

Structured Output

Forcing the model to output valid JSON, XML, or typed schemas. Use tools with strict output schemas rather than asking the model to "format as JSON" in free text.

06

Knowledge Grounding

Retrieval-Augmented Generation (RAG)

RAG combines retrieval (finding relevant documents) with generation (producing responses). It grounds model outputs in a specific knowledge base, dramatically reducing hallucination on factual tasks.

📄Source Docs

→

✂️Chunk & Clean

→

🔢Embed

→

🗄️Vector Store

→

🔍Retrieve Top-K

→

💬Generate

Chunking Strategies

Strategy	When to Use
Fixed-size (512 tokens)	Baseline; works for most homogeneous docs
Semantic (paragraph/section)	Docs with natural section breaks
Sliding window w/ overlap	When key info may span chunk boundaries
Hierarchical (doc → section → chunk)	Complex, nested documents
Proposition-based	High-precision Q&A requiring single-fact precision

Retrieval Methods

Method	Strength
Dense (vector cosine)	Semantic similarity, paraphrase-robust
Sparse (BM25/TF-IDF)	Exact keyword match, proper nouns
Hybrid (dense + sparse)	Best of both — recommended default
HyDE	Generate hypothetical answer first; retrieve using it
Re-ranking (cross-encoder)	Re-score top-N for precision; adds latency

Metadata filtering first. Before semantic search, filter by document type, date range, department, or other metadata. Retrieval over 50 filtered documents beats retrieval over 50,000 unfiltered ones — both in speed and relevance.

Don't skip evaluation. RAG systems fail silently. Set up a golden Q&A test set before launch. Measure both retrieval recall (did we fetch the right docs?) and answer faithfulness (did the model hallucinate beyond the retrieved content?) separately.

07

Trust & Control

Safety, Alignment & Guardrails

As AI systems gain agency, safety engineering becomes critical. The goal is ensuring models pursue intended goals and stop when they shouldn't proceed — without requiring constant human supervision.

Safety Layer

🛡️

Input Guardrails

Classify and filter incoming requests before they reach your main model. Use fast, cheap classifiers to detect policy violations, prompt injections, or off-topic requests. Fail loudly and early.

Safety Layer

🔬

Output Guardrails

Validate model outputs before returning to users or passing to downstream systems. Check for hallucinations, policy violations, PII leakage, or malformed structured outputs. Never trust raw model output blindly.

Safety Layer

🎯

Prompt Injection Defense

Malicious content in retrieved documents or user inputs can attempt to override system instructions. Use clear delimiters, privilege separation, and never interpolate untrusted content directly into system prompts.

Safety Layer

📋

Minimal Privilege

Only grant the model access to tools and data it needs for the current task. An agent answering customer questions doesn't need database write access. Scope tool permissions tightly.

Safety Layer

✋

Reversibility Preference

Design agents to prefer reversible over irreversible actions. "Draft email" before "send email." "Stage database change" before "commit." Irreversible actions should always require explicit human confirmation.

Safety Layer

📊

Observability & Logging

Log every agent step, tool call, and decision with full context. You cannot debug what you cannot observe. Structured traces are essential for diagnosing failures in complex agentic systems.

LLM outputs are not trusted by default. Treat model outputs like you would user-generated content: sanitize before rendering in HTML, validate before writing to databases, review before sending to external parties.

08

Production Engineering

Production Patterns & Best Practices

Moving from prototype to production requires reliability patterns, cost management, and evaluation frameworks. These are the practices that separate demos from dependable systems.

Reliability

Retry with Backoff
API calls fail. Always implement exponential backoff with jitter. 3 retries with 1s/2s/4s waits handles 99% of transient errors.
Fallback Models
If primary model is unavailable, fall back to secondary. Design model-agnostic prompt structures to make this easier.
Timeout Budgets
Set explicit timeouts per agent step. An agent stuck in a loop should self-terminate, not run until external kill.
Idempotency
Design tool calls to be safe to retry. Track tool call IDs to prevent duplicate side effects on retry.

Cost Management

Model Routing
Route simple tasks to fast/cheap models. Use an intent classifier to select which model tier handles each request.
Prompt Caching
Cache system prompts and large static context blocks. Re-use across calls with same prefix. Can cut costs 80%+ on repeated structures.
Response Caching
Cache identical or semantically similar queries. Use semantic cache keys for fuzzy matching.
Batch Processing
Use async batch APIs for non-latency-sensitive workloads. Typically 50% cheaper than synchronous calls.

Evaluation Framework

Eval Type

Unit Tests (Assertions)

Deterministic checks on model output: does it contain required fields? Is JSON valid? Does it avoid banned phrases? Fast, cheap, run on every change.

AutomatedAlways-on

Eval Type

LLM-as-Judge

A separate (typically larger) model evaluates quality dimensions: accuracy, helpfulness, safety, tone. Scales to thousands of samples without human review. Use structured scoring rubrics.

Scalable

Eval Type

Human Evaluation

Gold standard for subjective quality. Use for calibration of automated evals and for high-stakes decisions. Prefer pairwise comparisons (A vs. B) over absolute scoring — more reliable.

Ground Truth

Eval Type

A/B Testing in Prod

Route a percentage of real traffic to a new model/prompt version. Measure business metrics (task completion, escalation rate, satisfaction) rather than just model metrics.

Real Signal

Start evaluations on day one. Build a small golden test set (50–100 examples) before you build anything else. Every prompt change should run against it. Evals are the unit tests of AI systems — skipping them means shipping blind.

Quick Reference Checklist

Before Going Live

Prompt version controlled in git
Golden eval set created and passing
Input + output guardrails in place
All PII fields identified and masked
API key scoped to minimum permissions
Retry + timeout logic implemented
Full request/response logging enabled
Rate limit handling tested
Costs estimated and budget alerts set
Human escalation path defined

For Agentic Systems

Max steps / iteration limit set
Every tool call logged with args + result
Irreversible actions require confirmation
Minimal privilege enforced on all tools
Prompt injection vectors reviewed
Graceful degradation path exists
State corruption recovery planned
Human-in-the-loop checkpoint defined
Failure modes tested with adversarial inputs
Rollback procedure documented

The AI FieldGuide

AI Fundamentals & Model Architecture

The Context Window

Tokenization

Temperature & Sampling

Autoregressive Generation

System vs. User Prompts

Emergent Capabilities

Prompt Engineering

Core Techniques

Agentic AI & Multi-Step Systems

ReAct (Reason + Act)

Plan-and-Execute

Multi-Agent Systems

Reflection & Self-Critique

Tree of Thought

Human-in-the-Loop

Memory Systems & State

Context Window Strategies

Summarization Compression

Sliding Window

Hierarchical Context

Memory Extraction

Tool Use & Function Calling

Information Retrieval

Code Execution

Write / Mutate Actions

Agent Handoff

Computer Use

Structured Output

Retrieval-Augmented Generation (RAG)

Chunking Strategies

Retrieval Methods

Safety, Alignment & Guardrails

Input Guardrails

Output Guardrails

Prompt Injection Defense

Minimal Privilege

Reversibility Preference

Observability & Logging

Production Patterns & Best Practices

Reliability

Cost Management

Unit Tests (Assertions)

LLM-as-Judge

Human Evaluation

A/B Testing in Prod

The AI Field
Guide