Tech · AI & MLOps

The AI Field
Guide

A comprehensive reference for understanding, building with, and deploying modern AI systems — from prompt fundamentals to autonomous agentic pipelines.

8
Core Topics
50+
Best Practices
Rabbit Holes
01

AI Fundamentals & Model Architecture

Modern large language models (LLMs) are transformer-based neural networks trained on massive text corpora. Understanding their core mechanics helps you use them effectively and avoid common pitfalls.

Core Concept
🧠

The Context Window

Every model processes text within a fixed "window" of tokens — roughly 4 characters each. Tokens within this window are all equally accessible to the model. Content outside it simply doesn't exist to the model during that call.

Core Concept
🔢

Tokenization

Text is split into tokens before processing. Common words are single tokens; rare or technical words split into multiple. "unbelievable" might be 3 tokens. This affects cost, context limits, and how the model processes text.

Core Concept
🌡️

Temperature & Sampling

Temperature (0–1+) controls randomness. At 0, the model picks the most probable next token deterministically. Higher values introduce creative variation. Use low temps for factual tasks, higher for creative generation.

Architecture
🔄

Autoregressive Generation

Models generate one token at a time, each conditioned on all previous tokens. This means early choices constrain later ones — a key reason why structured output formats and clear early framing help so much.

Architecture
📐

System vs. User Prompts

System prompts define the model's persona, capabilities, and constraints at setup time. User prompts are the runtime requests. Most models give higher authority to system-level instructions.

Key Insight

Emergent Capabilities

Capabilities like reasoning, in-context learning, and tool use aren't explicitly programmed — they emerge from scale. This is both the magic and the mystery of modern LLMs.

Model Type Best For Limitations Typical Use Case
frontier/large Complex reasoning, nuanced writing, multi-step tasks Slower, more expensive per token Legal review, architecture design, research synthesis
mid-size Good balance of speed, cost, capability May struggle with very complex chains Customer support, coding assistance, summarization
small/fast High throughput, low latency, classification Limited reasoning depth Intent routing, triage, simple Q&A
fine-tuned Domain-specific tasks with consistent format Requires data/training; may lose general capability Medical coding, financial extraction, brand voice
embedding Semantic search, clustering, similarity Not generative — produces vectors, not text RAG retrieval, duplicate detection, recommendation
02

Prompt Engineering

Prompting is how you communicate intent to an AI model. It's part art, part science — and the highest-leverage skill for getting reliable, quality outputs.

Anatomy of a Great Prompt
Prompt Template
# SYSTEM PROMPT
You are an expert data analyst at a logistics company.
Your tone is concise, precise, and uses metric units.
Always structure responses as: Summary → Data → Recommendation.
Never speculate beyond the provided data.

# USER PROMPT
Context: Q3 shipment data shows 12% delay increase in Zone B.

Task: Identify the top 3 likely causes and rank by impact.

Format: Numbered list. Max 3 sentences per item.

Output example:
1. [Cause] — [Evidence] — [Recommended action]
2. ...

Core Techniques

  • Zero-shot Prompting
    Simply ask. Works when task is clear and model has strong priors.
  • Few-shot Examples
    Include 2–5 input/output examples. Dramatically improves format consistency.
  • Chain-of-Thought
    Add "Think step by step" or show reasoning in examples. Unlocks multi-step logic.
  • Role Assignment
    Define who the model is. Primes tone, vocabulary, and domain knowledge activation.
  • Output Constraints
    Specify format, length, structure. Never assume the model knows your preferred format.
  • Negative Instructions
    Explicitly state what NOT to do. "Do not include caveats" outperforms hoping it won't.

Place the most important instruction first and last in your prompt. Models show recency and primacy bias — key constraints buried in the middle are often ignored.

Vague prompts like "write something good about our product" produce mediocre outputs. Specificity is the single biggest lever: audience, format, tone, length, goal, constraints.

Temperature doesn't replace prompt quality. A well-designed prompt at temp 0 usually outperforms a vague prompt at any temperature.

Prompt Patterns Reference
Chain-of-Thought Example
// Prompting the model to reason before answering
const prompt = `
Question: If a train leaves at 9:00 AM traveling 80 mph and another
leaves at 10:30 AM traveling 100 mph from 400 miles away, when do they meet?

Instructions: Work through this step by step, showing your calculations.
Then give a final single-sentence answer.
`;

// Output will contain visible reasoning chain + final answer
// This "scratchpad" approach dramatically improves accuracy
03

Agentic AI & Multi-Step Systems

Agentic AI refers to systems where an LLM autonomously plans, uses tools, and executes multi-step workflows — rather than responding to a single prompt and stopping.

An "agent" is just an LLM in a loop: it perceives state, decides on an action, executes it, observes the result, and repeats — until the task is complete or it determines it can't proceed. — Architectural principle
The Agent Loop
🎯Goal / Task
🤔Plan & Reason
🔧Select Tool
Execute
👁️Observe Result
🔁Loop or Done
Pattern
🤖

ReAct (Reason + Act)

The model alternates between "Thought" (reasoning about what to do next), "Action" (invoking a tool), and "Observation" (processing the result). Traces are fully visible and debuggable.

ReliableTransparent
Pattern
🗓️

Plan-and-Execute

A "planner" model generates a full task breakdown first. An "executor" then works through each step with access to tools. Separates high-level strategy from low-level execution.

StructuredComposable
Pattern
🌐

Multi-Agent Systems

Multiple specialized agents coordinate: an orchestrator delegates to sub-agents (researcher, writer, coder). Each has narrow scope and relevant tools. Scales complexity without overloading a single context.

ScalableSpecialized
Pattern
🔍

Reflection & Self-Critique

After generating output, the model reviews its own work against criteria, then revises. A separate "critic" model can also evaluate outputs. Substantially improves quality on complex tasks.

QualityIterative
Pattern
🌲

Tree of Thought

Explores multiple reasoning paths in parallel (branching like a tree), evaluates each, and pursues the most promising branch. Excellent for problems with many plausible solution paths.

Complex Tasks
Pattern
🔃

Human-in-the-Loop

Checkpoints where the agent pauses and requests human approval before taking consequential actions. Essential for any task with real-world effects (sending emails, writing code to deploy, etc.).

SafeRecommended

Start with minimal autonomy. Build agents that run 1–2 steps before requiring human confirmation. Only expand autonomy after you understand failure modes. The biggest agentic failures come from over-trusting a model's plan early in development.

Simple ReAct Agent Loop (Python Pseudocode)
def run_agent(task: str, tools: dict, max_steps: int = 10):
    messages = [
        {"role": "system", "content": AGENT_SYSTEM_PROMPT},
        {"role": "user",   "content": task}
    ]

    for step in range(max_steps):
        response = llm.call(messages, tool_definitions=tools)

        # Agent chose to use a tool
        if response.tool_call:
            tool_name = response.tool_call.name
            tool_args = response.tool_call.args
            tool_result = tools[tool_name](**tool_args)  # Execute tool

            messages.append(response.as_message())
            messages.append({
                "role": "tool",
                "content": str(tool_result)
            })

        # Agent is done — return final answer
        elif response.final_answer:
            return response.final_answer

    raise AgentTimeoutError("Max steps reached without resolution")
04

Memory Systems & State

LLMs have no built-in persistent memory — they start fresh every conversation. All "memory" must be engineered. Understanding the four memory types is foundational for building coherent AI systems.

Type 1 — In-Context
Working Memory

Everything within the current context window. Fast, zero-latency. Disappears when the session ends. Limited by token budget. This is all the model natively has.

Type 2 — External Store
Episodic Memory

Past interactions stored externally (DB, vector store). Retrieved and injected into context when relevant. Enables persistent user history and long-term continuity.

Type 3 — Semantic Store
Knowledge Base

Structured or embedded knowledge (documents, facts, code). Retrieved via semantic search (RAG). Gives the model access to information beyond its training cutoff.

Type 4 — Procedural
Skill / Fine-tuned Weights

Capabilities baked into model weights via fine-tuning or RLHF. These are "how to do X" patterns the model learned during training. Not easily updated post-deployment.

Context Window Strategies

  1. Summarization Compression

    Periodically summarize older conversation turns into a compact representation. Replace verbose history with the summary to free up token budget.

  2. Sliding Window

    Keep only the last N turns in context. Cheap to implement. Works well when recent context is most relevant (e.g., coding sessions, live support chats).

  3. Hierarchical Context

    Maintain a multi-level structure: full recent turns + summaries of older turns + a persistent "facts extracted" block for key entities and decisions.

  4. Memory Extraction

    After each turn, prompt the model: "Extract any new facts worth remembering." Store structured key-value pairs externally and re-inject on next session start.

Token budget management: Reserve at least 20% of your context window for the model's response. If your prompt uses 90% of the window, outputs will be truncated or degraded.

Memory Injection Pattern
def build_prompt_with_memory(user_msg, user_id):
    # 1. Retrieve relevant memories
    memories = memory_store.search(
        query=user_msg,
        user_id=user_id,
        top_k=5
    )

    # 2. Retrieve recent history
    history = history_store.get_recent(
        user_id=user_id,
        turns=10
    )

    # 3. Inject into system prompt
    system = f"""
User context:
{format_memories(memories)}

Recent conversation:
{format_history(history)}
    """
    return system, user_msg
05

Tool Use & Function Calling

Tool use (aka function calling) allows models to invoke external capabilities — APIs, databases, code executors — and incorporate results into their responses. This is what transforms a chatbot into an agent.

Tool Definition Schema (Anthropic API format)
tools = [
  {
    "name": "search_knowledge_base",
    "description": "Search internal documentation. Use when the user asks
      about company policies, product specs, or procedures.
      Returns top 3 relevant passages with source IDs.",
    "input_schema": {
      "type": "object",
      "properties": {
        "query": {
          "type": "string",
          "description": "Natural language search query"
        },
        "category": {
          "type": "string",
          "enum": ["policies", "products", "procedures"],
          "description": "Filter results by category"
        }
      },
      "required": ["query"]
    }
  }
]

# Key: descriptions are for the MODEL, not developers.
# Be explicit about WHEN to use this tool vs. others.

The description field of each tool is your most powerful lever. Describe when to use the tool, what it returns, and any edge cases. A well-described tool with average implementation outperforms a brilliant tool with a vague description.

Tool Category
🔍

Information Retrieval

Web search, vector DB search, SQL queries, API reads. Give models access to current, domain-specific, or user-specific data. The most common and safest category.

Tool Category
💻

Code Execution

Sandboxed Python/JS interpreters. Lets models verify math, process data, generate charts. Always sandbox — never expose direct system access to a model.

Tool Category
✍️

Write / Mutate Actions

Creating files, sending messages, updating databases, triggering webhooks. Highest risk category — add confirmation steps and human-in-the-loop checkpoints before any irreversible action.

Tool Category
🤝

Agent Handoff

Calling specialized sub-agents or routing to different model configurations. Enables modular multi-agent architectures where each agent handles its domain.

Tool Category
🖥️

Computer Use

Taking screenshots, clicking UI elements, filling forms. Frontier capability — models can interact with arbitrary software as a human would. Requires careful sandboxing and oversight.

Tool Category
📊

Structured Output

Forcing the model to output valid JSON, XML, or typed schemas. Use tools with strict output schemas rather than asking the model to "format as JSON" in free text.

06

Retrieval-Augmented Generation (RAG)

RAG combines retrieval (finding relevant documents) with generation (producing responses). It grounds model outputs in a specific knowledge base, dramatically reducing hallucination on factual tasks.

📄Source Docs
✂️Chunk & Clean
🔢Embed
🗄️Vector Store
🔍Retrieve Top-K
💬Generate

Chunking Strategies

StrategyWhen to Use
Fixed-size (512 tokens)Baseline; works for most homogeneous docs
Semantic (paragraph/section)Docs with natural section breaks
Sliding window w/ overlapWhen key info may span chunk boundaries
Hierarchical (doc → section → chunk)Complex, nested documents
Proposition-basedHigh-precision Q&A requiring single-fact precision

Retrieval Methods

MethodStrength
Dense (vector cosine)Semantic similarity, paraphrase-robust
Sparse (BM25/TF-IDF)Exact keyword match, proper nouns
Hybrid (dense + sparse)Best of both — recommended default
HyDEGenerate hypothetical answer first; retrieve using it
Re-ranking (cross-encoder)Re-score top-N for precision; adds latency

Metadata filtering first. Before semantic search, filter by document type, date range, department, or other metadata. Retrieval over 50 filtered documents beats retrieval over 50,000 unfiltered ones — both in speed and relevance.

Don't skip evaluation. RAG systems fail silently. Set up a golden Q&A test set before launch. Measure both retrieval recall (did we fetch the right docs?) and answer faithfulness (did the model hallucinate beyond the retrieved content?) separately.

07

Safety, Alignment & Guardrails

As AI systems gain agency, safety engineering becomes critical. The goal is ensuring models pursue intended goals and stop when they shouldn't proceed — without requiring constant human supervision.

Safety Layer
🛡️

Input Guardrails

Classify and filter incoming requests before they reach your main model. Use fast, cheap classifiers to detect policy violations, prompt injections, or off-topic requests. Fail loudly and early.

Safety Layer
🔬

Output Guardrails

Validate model outputs before returning to users or passing to downstream systems. Check for hallucinations, policy violations, PII leakage, or malformed structured outputs. Never trust raw model output blindly.

Safety Layer
🎯

Prompt Injection Defense

Malicious content in retrieved documents or user inputs can attempt to override system instructions. Use clear delimiters, privilege separation, and never interpolate untrusted content directly into system prompts.

Safety Layer
📋

Minimal Privilege

Only grant the model access to tools and data it needs for the current task. An agent answering customer questions doesn't need database write access. Scope tool permissions tightly.

Safety Layer

Reversibility Preference

Design agents to prefer reversible over irreversible actions. "Draft email" before "send email." "Stage database change" before "commit." Irreversible actions should always require explicit human confirmation.

Safety Layer
📊

Observability & Logging

Log every agent step, tool call, and decision with full context. You cannot debug what you cannot observe. Structured traces are essential for diagnosing failures in complex agentic systems.

LLM outputs are not trusted by default. Treat model outputs like you would user-generated content: sanitize before rendering in HTML, validate before writing to databases, review before sending to external parties.

08

Production Patterns & Best Practices

Moving from prototype to production requires reliability patterns, cost management, and evaluation frameworks. These are the practices that separate demos from dependable systems.

Reliability

  • Retry with Backoff
    API calls fail. Always implement exponential backoff with jitter. 3 retries with 1s/2s/4s waits handles 99% of transient errors.
  • Fallback Models
    If primary model is unavailable, fall back to secondary. Design model-agnostic prompt structures to make this easier.
  • Timeout Budgets
    Set explicit timeouts per agent step. An agent stuck in a loop should self-terminate, not run until external kill.
  • Idempotency
    Design tool calls to be safe to retry. Track tool call IDs to prevent duplicate side effects on retry.

Cost Management

  • Model Routing
    Route simple tasks to fast/cheap models. Use an intent classifier to select which model tier handles each request.
  • Prompt Caching
    Cache system prompts and large static context blocks. Re-use across calls with same prefix. Can cut costs 80%+ on repeated structures.
  • Response Caching
    Cache identical or semantically similar queries. Use semantic cache keys for fuzzy matching.
  • Batch Processing
    Use async batch APIs for non-latency-sensitive workloads. Typically 50% cheaper than synchronous calls.
Evaluation Framework
Eval Type

Unit Tests (Assertions)

Deterministic checks on model output: does it contain required fields? Is JSON valid? Does it avoid banned phrases? Fast, cheap, run on every change.

AutomatedAlways-on
Eval Type

LLM-as-Judge

A separate (typically larger) model evaluates quality dimensions: accuracy, helpfulness, safety, tone. Scales to thousands of samples without human review. Use structured scoring rubrics.

Scalable
Eval Type

Human Evaluation

Gold standard for subjective quality. Use for calibration of automated evals and for high-stakes decisions. Prefer pairwise comparisons (A vs. B) over absolute scoring — more reliable.

Ground Truth
Eval Type

A/B Testing in Prod

Route a percentage of real traffic to a new model/prompt version. Measure business metrics (task completion, escalation rate, satisfaction) rather than just model metrics.

Real Signal

Start evaluations on day one. Build a small golden test set (50–100 examples) before you build anything else. Every prompt change should run against it. Evals are the unit tests of AI systems — skipping them means shipping blind.

Quick Reference Checklist

Before Going Live

  • Prompt version controlled in git
  • Golden eval set created and passing
  • Input + output guardrails in place
  • All PII fields identified and masked
  • API key scoped to minimum permissions
  • Retry + timeout logic implemented
  • Full request/response logging enabled
  • Rate limit handling tested
  • Costs estimated and budget alerts set
  • Human escalation path defined

For Agentic Systems

  • Max steps / iteration limit set
  • Every tool call logged with args + result
  • Irreversible actions require confirmation
  • Minimal privilege enforced on all tools
  • Prompt injection vectors reviewed
  • Graceful degradation path exists
  • State corruption recovery planned
  • Human-in-the-loop checkpoint defined
  • Failure modes tested with adversarial inputs
  • Rollback procedure documented

A living reference — as the field evolves, so should your practices