Tech · Observability

Datadog
Observability Platform

A comprehensive guide to Datadog's agent, metrics, logs, APM, infrastructure monitoring, alerting, dashboards, synthetics, security, and integrations — built from official documentation.

Agent v7 Metrics & Logs APM & Tracing Monitors & Alerts Security 1,000+ Integrations OpenTelemetry Kubernetes

01 — Product Overview

What is Datadog?

Datadog is a cloud-scale observability and security platform that unifies metrics, logs, traces, real user monitoring, synthetic testing, and security signals into a single pane of glass across any infrastructure.

📡

Infrastructure Monitoring

Collect 75–100 system metrics every 15–20 seconds from hosts, containers, cloud services, and serverless functions via the lightweight Agent.

🔬

APM & Distributed Tracing

End-to-end distributed tracing with flame graphs, service maps, error tracking, deployment comparison, and Continuous Profiler.

📋

Log Management

Centralized log ingestion, parsing pipelines, real-time Live Tail search, and Flex Logs tiered storage with up to 7-year retention.

🔔

Monitors & Alerting

Threshold, anomaly, forecast, composite, and ML-based monitors with multi-channel alerting via PagerDuty, Slack, OpsGenie, and more.

🛡

Security

Cloud SIEM, Cloud Security Management, App & API Protection, Code Security, and Workload Protection on one unified platform.

🤖

AI & ML Features

Watchdog AI for automated anomaly detection, root cause analysis, LLM Observability, and Issue Correlation across services.

Three pillars of observability: Datadog unifies Metrics (what is happening), Logs (why it happened), and Traces (where in the call chain). With APM enabled, the Agent auto-injects trace IDs into logs — a click on any log takes you directly to the correlated distributed trace.

02 — Platform Architecture

How Datadog Works

A lightweight Agent deployed on each host collects and buffers telemetry, forwarding it to the Datadog SaaS backend over HTTPS (metrics) and SSL-encrypted TCP (logs) for processing, storage, and analysis.

YOUR INFRASTRUCTURE

Linux Host

Agent + Checks

Kubernetes

DaemonSet + Cluster Agent

Docker

Container Agent

AWS Lambda

Extension / Forwarder

Windows

Agent Service

↓ collected by ↓

DATADOG AGENT PROCESSES

Collector

Runs checks, gathers metrics

Forwarder

Buffers + sends over HTTPS

DogStatsD

Custom metrics UDP/UDS :8125

APM Agent

Traces from apps :8126

Process Agent

Live process info

↓ HTTPS (metrics) / SSL-TCP (logs) ↓

DATADOG BACKEND (SaaS)

Metrics Store

15-month retention

Log Management

Standard / Flex Tiers

Trace Storage

15 days indexed spans

Watchdog AI

ML anomaly detection

↓ visualize / alert / notify ↓

Dashboards

Monitors

APM Traces

Log Explorer

PagerDuty / Slack

Agent Internal Components

Collector — runs all configured checks and gathers metrics on a 15–20 second interval. Passes output to the local Aggregator and Forwarder.

Forwarder — sends payloads to Datadog over HTTPS. Buffers metrics in memory during network outages, preventing data loss. Discards oldest data only at memory limits.

APM Agent — separate optional process collecting distributed traces. Enabled by default. Listens on port 8126.

Process Agent — collects live process and container info. Requires explicit enablement for full process monitoring.

DogStatsD — Go implementation of StatsD. Accepts custom metrics over UDP (port 8125) or Unix socket, aggregates and forwards to backend.

Data Retention

Data Type	Default Retention
Metrics	15 months
Custom span-based metrics	15 months
Indexed spans / traces	15 days
Ingested spans (in-flight)	15 minutes
Standard Tier logs	3–30 days (configurable)
Flex Logs (frozen tier)	Up to 7 years
RUM sessions	30 days

03 — The Datadog Agent

Agent Installation & Configuration

The Datadog Agent is open-source software written in Go that runs on every monitored host. Agent 7 is the latest major version. Resource footprint: ~0.08% CPU avg, ~95 MB RAM, 880 MB–1.3 GB disk on a c5.xlarge.

Installation

BASH — Linux one-liner
# Install Agent 7 on Linux
DD_API_KEY="<YOUR_API_KEY>" DD_SITE="datadoghq.com" \
  bash -c "$(curl -L https://install.datadoghq.com/scripts/install_script_agent7.sh)"

# Service management
sudo systemctl start   datadog-agent
sudo systemctl stop    datadog-agent
sudo datadog-agent     status

DOCKER
docker run -d --name dd-agent \
  -e DD_API_KEY="<API_KEY>" \
  -e DD_SITE="datadoghq.com" \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  -v /proc/:/host/proc/:ro \
  -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
  gcr.io/datadoghq/agent:7

KUBERNETES — Datadog Operator (recommended)
helm repo add datadog https://helm.datadoghq.com
helm install datadog-operator datadog/datadog-operator

# DatadogAgent custom resource
apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  global:
    clusterName: my-cluster
    credentials:
      apiSecret:
        secretName: datadog-secret
        keyName: api-key
  features:
    apm:
      enabled: true
    logCollection:
      enabled: true
      containerCollectAll: true
    liveProcesses:
      enabled: true

Core Configuration — datadog.yaml

YAML — /etc/datadog-agent/datadog.yaml
api_key: <YOUR_API_KEY>
site:    datadoghq.com         # or datadoghq.eu, us3.datadoghq.com
hostname: my-host.example.com

## Tags applied to ALL telemetry from this host
tags:
  - env:production
  - team:platform
  - region:us-east-1

## Enable features
logs_enabled:       true
apm_config:
  enabled:          true
process_config:
  process_collection:
    enabled:        true

## DogStatsD
dogstatsd_port:     8125
dogstatsd_socket:   /var/run/datadog/dsd.socket

## Optional proxy
proxy:
  https:            http://proxy.corp:3128

Deployment Options

Environment	Method
Linux	install_script_agent7.sh
Windows	MSI installer / Chocolatey
macOS	Homebrew / DMG
Docker	gcr.io/datadoghq/agent:7
Kubernetes	Datadog Operator (recommended)
Kubernetes (alt)	datadog/datadog Helm chart
AWS ECS	Daemon service / Fargate sidecar
AWS Lambda	Lambda Extension / Forwarder
IoT	IoT Agent (lightweight binary)

Integration Check Config

YAML — conf.d/nginx.d/conf.yaml
init_config:

instances:
  - nginx_status_url: http://localhost/nginx_status
    tags:
      - service:nginx
      - env:production

K8s — Autodiscovery annotations
annotations:
  ad.datadoghq.com/nginx.check_names:  '["nginx"]'
  ad.datadoghq.com/nginx.init_configs: '[{}]'
  ad.datadoghq.com/nginx.instances:
    '[{"nginx_status_url":"http://%%host%%/nginx_status"}]'

Fleet Automation: Remotely configure, upgrade, and manage all Agents across all environments from the Datadog UI — no SSH required. Supports automatic rollback if an Agent fails to restart after upgrade.

04 — Metrics

Metrics Collection & Custom Metrics

Metrics are time-series data points identified by a name and tags. They can be collected by the Agent, submitted via API/DogStatsD, or imported from cloud provider integrations.

Metric Types

Type	Description	Use Case
COUNT	Events in a flush interval	http.requests
RATE	Count divided by time interval	requests.per_second
GAUGE	Instantaneous value at flush	cpu.usage, mem.free
HISTOGRAM	Statistical distribution	response.time
DISTRIBUTION	Global percentile calculations	latency.p99
SET	Count of unique elements	unique_users

HISTOGRAM outputs: When using HISTOGRAM, DogStatsD automatically sends .avg, .count, .median, .95percentile, .max, and .min as separate metric streams.

Submitting Custom Metrics via DogStatsD

PYTHON
from datadog import initialize, statsd
initialize(statsd_host='localhost', statsd_port=8125)

# Gauge
statsd.gauge('app.queue.depth', 42,
  tags=['env:prod', 'service:worker'])

# Increment counter
statsd.increment('app.page.views',
  tags=['page:home'])

# Histogram (timing)
statsd.histogram('db.query.time', 0.042,
  tags=['query:get_user'])

# Distribution (global percentiles)
statsd.distribution('api.response.time', 125.3,
  tags=['endpoint:/checkout'])

APM Metric Namespaces

Namespace	Captures
trace.<span>.hits	Request count per service
trace.<span>.errors	Error count per service
trace.<span>.apdex	Apdex score (HTTP/web)
runtime.*	Language runtime metrics

05 — Log Management

Log Collection, Pipelines & Storage

Datadog Log Management centralizes logs from all sources with real-time search, Grok parsing pipelines, faceted exploration, alerting, and Flex Logs for long-term cost-effective retention.

📄

Collection

File, Docker, K8s, AWS, API

→

⚙️

Processing

Grok parsing pipelines

→

🏷

Enrichment

Tags, attributes, lookups

→

🗄

Indexing

Retention filters + Flex

→

🔍

Log Explorer

Search, Live Tail, facets

→

🔔

Monitors

Threshold & anomaly alerts

Enabling Log Collection

YAML — datadog.yaml
logs_enabled: true

YAML — conf.d/python.d/conf.yaml
logs:
  - type:    file
    path:    /var/log/myapp/*.log
    source:  python
    service: my-service
    tags:
      - env:production

DOCKER — label-based collection
labels:
  com.datadoghq.ad.logs:
    '[{"source":"nginx","service":"web"}]'

Log Limits

Limit	Value
Max log size (HTTPS)	1 MB
Recommended per log	< 25 KB
Agent auto-split threshold	900 KB
Max tags per log event	100
Max JSON attributes	256
Max attribute key length	50 chars

Storage Tiers

🔥 Standard Tier

Fully indexed logs for real-time search, monitors, dashboards, Live Tail. Configurable 3–30 day retention. Full Log Explorer capabilities.

❄️ Flex Logs

Cost-effective tier for lower-query-frequency logs. In-place searchability without rehydration. Flex Frozen sub-tier stores up to 7 years for compliance and forensic investigation.

🗃 Archive Search

Query logs archived directly in cloud storage (S3, GCS, Azure Blob) or Flex Frozen — without exporting or rehydrating. Ideal for audits and long-range analytics.

Trace-log correlation: APM auto-injects trace IDs into logs. Clicking a log entry with a trace ID jumps immediately to the correlated trace — no manual query building.

Grok Parser Example

GROK — Apache combined access log rule
access_log %{ip:network.client.ip} %{notSpace:http.ident} %{notSpace:http.auth} \
  \[%{date("dd/MMM/yyyy:HH:mm:ss Z"):date}\] \
  "%{word:http.method} %{notSpace:http.url} %{notSpace:http.version}" \
  %{integer:http.status_code} %{integer:network.bytes_written}

06 — APM & Distributed Tracing

Application Performance Monitoring

Datadog APM provides end-to-end distributed tracing, flame graphs, service maps, deployment tracking, and Continuous Profiler — with deep correlation to logs, metrics, and RUM.

Instrumentation Methods

⚡

Single Step Instrumentation

Installs Agent + instruments app in one step — no code changes. The simplest starting point.

📚

Tracing Libraries

Language-specific libraries for Python, Java, Go, Ruby, Node.js, .NET, PHP, C++, Rust.

🔭

OpenTelemetry

Send OTel metrics, traces, and logs into Datadog via the Collector with Datadog Exporter.

🔧

Dynamic Instrumentation

Add instrumentation to live running services via the Datadog UI — no code deploys or restarts required.

Key APM Features

🗺

Service Map

Auto-generated dependency map of all services with real-time error rates and latency per connection.

🔥

Flame Graphs

Full call tree of any trace with time-spent visualization. Identify slowest code paths instantly.

❌

Error Tracking

Intelligent error grouping across services. Track new vs. regressing issues by deployed version.

🚀

Deployment Tracking

Compare error rate, latency, and throughput before/during/after each deployment. Auto-detect faulty deploys via Watchdog.

📊

Continuous Profiler

Always-on low-overhead code profiling in production. See exactly which methods consume CPU, memory, and I/O.

🔗

Trace-Log Correlation

Trace IDs injected into logs. View logs side-by-side with the trace that generated them.

Python APM

PYTHON
pip install ddtrace

# Auto-instrument at startup (recommended)
DD_SERVICE="my-api" DD_ENV="prod" DD_VERSION="1.2.0" \
  ddtrace-run python app.py

# Manual span creation
from ddtrace import tracer

with tracer.trace("db.query", resource="SELECT users") as span:
    span.set_tag("db.type", "postgres")
    result = db_query()

Java APM

JAVA — JVM flag
java -javaagent:/path/to/dd-java-agent.jar \
  -Ddd.service=my-app \
  -Ddd.env=production \
  -Ddd.version=1.0.0 \
  -jar app.jar

OpenTelemetry Collector

YAML — OTel Collector Datadog Exporter
exporters:
  datadog:
    api:
      key:  ${DD_API_KEY}
      site: datadoghq.com
    traces:
      compute_stats_by_span_kind: true

service:
  pipelines:
    traces:
      exporters: [datadog]
    metrics:
      exporters: [datadog]

Unified Service Tagging: Apply env, service, and version consistently across all telemetry types from a service to enable seamless pivoting between metrics, logs, and traces.

07 — Infrastructure Monitoring

Infrastructure & Container Monitoring

Monitor hosts, containers, Kubernetes clusters, cloud services, and serverless functions from a unified Infrastructure List with real-time metrics, health status, and live process monitoring.

Infrastructure Views

Infrastructure List — every host with key metrics and tag filtering
Host Map — hexagonal heatmap of all hosts by any metric
Containers page — resource metrics and faceted search across containers
Container Images — every image in your env + vulnerability data
Orchestrator Explorer — monitor pods, deployments, namespaces
Control Plane Monitoring — API server, scheduler, controller manager, etcd
Live Processes — real-time process list with CPU, memory, I/O
Network Performance Monitoring — eBPF-based traffic flow visibility

Key System Metrics

Category	Example Metrics
CPU	system.cpu.user, system.load.1
Memory	system.mem.used, system.swap.used
Disk I/O	system.io.rkb_s, system.disk.used
Network	system.net.bytes_rcvd, bytes_sent

Cloud Integrations

AWSEC2, ECS, EKS, Lambda, RDS, S3, CloudWatch…

GCPGCE, GKE, Cloud SQL, Cloud Run, Pub/Sub…

AzureVMs, AKS, Functions, Event Hubs, Blob…

Kubernetes Cluster Agent

The Cluster Agent efficiently gathers monitoring data from across an orchestrated cluster. It distributes check configurations to node Agents and ensures only one instance of each check runs per workload — preventing duplicate data collection across replicas.

The Cluster Agent holds configs and dispatches them to node Agents every 10 seconds. If a node Agent stops reporting, the Cluster Agent removes it from the active pool and redistributes its configurations.

Serverless

For AWS Lambda, Datadog collects metrics, traces, and logs via the Lambda Extension (preferred, runs in-process) or Lambda Forwarder (CloudWatch-based). Supports enhanced Lambda metrics, cold start detection, and X-Ray integration.

08 — Monitors & Alerting

Monitors, Alerts & SLOs

Monitors evaluate metric, log, or trace queries against defined conditions and trigger alerts with notifications to PagerDuty, Slack, email, OpsGenie, and more. Evaluation frequency defaults to 1 minute.

Monitor Types

📊

Metric Monitor

Alert when a metric threshold is crossed over a rolling window. Simple or multi-alert modes grouped by any tag.

📋

Log Monitor

Alert on indexed log count, attribute unique count, or measure. Supports group-by facets. Max 2-day rolling window.

🔬

APM Monitor

Monitor service APM metrics (hits, errors, latency) or alert on Trace Analytics Indexed Span patterns.

🤖

Anomaly Monitor

ML-based detection learns seasonal patterns. Alerts on statistically unexpected deviations without manual thresholds.

🔮

Forecast Monitor

Predicts when a metric will breach a threshold. Ideal for disk capacity and resource planning.

🌐

Synthetic Monitor

Alert when a Synthetic API test or browser test fails or exceeds latency thresholds.

🔧

Service Check

Alert based on OK / WARNING / CRITICAL status submitted by Agent integration checks.

🔗

Composite Monitor

Combine monitors with boolean logic (AND, OR, NOT). Alert only when multiple conditions are simultaneously true.

💾

Database Monitoring

Alert on slow queries, connection pool saturation, replication lag for PostgreSQL, MySQL, SQL Server, Oracle.

Monitor — Terraform

TERRAFORM
resource "datadog_monitor" "high_cpu" {
  name    = "High CPU Usage"
  type    = "metric alert"
  message = "CPU > 90% on {{host.name}} @pagerduty"

  query = "avg(last_5m):avg:system.cpu.user{env:production} by {host} > 90"

  monitor_thresholds {
    critical = 90
    warning  = 75
  }
  notify_no_data    = true
  no_data_timeframe = 20
  tags = ["env:production", "team:platform"]
}

Notification Channels

@pagerduty

@slack-channel

@email

@opsgenie

@victorops

@webhook

@teams

@jira

Monitor Configuration Reference

Option	Description
evaluation_window	Time range for query (last_5m, last_1h)
evaluation_frequency	How often query runs (default 1 min)
critical	Value triggering ALERT state
warning	Value triggering WARNING state
notify_no_data	Alert if no data is received
renotify_interval	Re-alert on sustained state (minutes)
require_full_window	Only evaluate with complete data window
multi_alert	Separate alert per dimension (e.g., per host)

Service Level Objectives (SLOs)

Three SLO types:

Type	Based On
Metric-based	Good events / total events ratio
Monitor-based	Uptime % derived from monitor state
Time Slice	% of time windows metric was within threshold

09 — Dashboards & Visualization

Dashboards

Dashboards provide real-time insight into system health and business KPIs. Build from any combination of metrics, logs, traces, RUM, and events with template variables for dynamic scoping.

Dashboard Types

Type	Use Case
Timeboard	All widgets share the same time range. Best for metric correlation during investigations.
Screenboard	Free-form layout with independent time ranges per widget. Best for NOC status displays.
Notebook	Markdown + live graphs. Best for postmortems, runbooks, and incident investigations.

Widget Types

Timeseries

Query Value

Top List

Table

Distribution / Heatmap

Pie Chart

Scatter Plot

Geo Map

SLO Widget

Service Map

Log Stream

Alert Graph

Change

Funnel

Wildcard (Vega-Lite)

Template Variables — Terraform

TERRAFORM
resource "datadog_dashboard" "service_health" {
  title       = "Service Health"
  layout_type = "ordered"

  template_variable {
    name    = "env"
    prefix  = "env"
    default = "production"
  }

  widget {
    timeseries_definition {
      request {
        q = "avg:trace.web.request.duration{$env} by {service}"
      }
      title = "Request Latency by Service"
    }
  }
}

Datadog Sheets: Spreadsheet-style interface for analyzing telemetry — perform lookups, build pivot tables, create calculated columns, join datasets. Results can be added to dashboards or shared with colleagues.

10 — Synthetic Monitoring & RUM

Synthetics & Real User Monitoring

Synthetics proactively tests endpoints and journeys from Datadog-managed global locations. RUM captures real user interactions and performance from actual browsers and mobile apps.

Synthetic Test Types

Type	Description
API Test	HTTP, gRPC, WebSocket, TCP, SSL, DNS checks. Assert on status codes, body, headers, latency.
Multistep API	Chain multiple API calls with variables passed between steps. Test full auth + action flows.
Browser Test	Headless Chrome tests that record and replay user journeys. Detect visual regressions and broken UI.
Mobile Test	Native iOS and Android app testing with real device simulation.

Continuous Testing (CI/CD)

BASH — datadog-ci
npm install -g @datadog/datadog-ci

# Run Synthetic tests in CI pipeline
datadog-ci synthetics run-tests \
  --public-id "abc-123-xyz" \
  --apiKey $DD_API_KEY \
  --failOnCriticalErrors

Real User Monitoring (RUM)

Session Replay — pixel-perfect video-like replay of real user sessions
Core Web Vitals — LCP, FID, CLS tracking with custom user timings
Error Tracking — frontend JS errors grouped and prioritized by user impact
RUM-APM Correlation — link frontend sessions to backend distributed traces
Funnel Analysis — track conversion rates through multi-step user flows
RUM Recommendations — AI-powered performance improvement suggestions (Preview)
Mobile RUM — iOS and Android performance monitoring and crash tracking

Feature Flags & A/B Testing

Datadog Feature Flags integrates with your existing feature flag provider to track flag evaluations alongside RUM data. Correlate feature flag rollouts directly with performance regressions and error spikes in the same view.

11 — Security

Security Products

Datadog unifies observability and security on one platform — eliminating the context-switching between tools that slows down incident response when a performance issue has a security dimension.

🛡

Cloud SIEM

Detect, investigate, and respond to security threats across cloud and on-premises systems. Correlates logs, metrics, and network data to surface high-fidelity signals.

☁️

Cloud Security Management

Continuously audits cloud configurations, assesses identity risks (CIEM), and detects runtime threats across AWS, GCP, and Azure.

🔒

App & API Protection

Detects and blocks threats targeting production applications and APIs in real time, with APM trace context for each attack signal.

💻

Code Security

Detects and fixes vulnerabilities in first-party code, open-source dependencies (SCA), and infrastructure-as-code from dev through runtime (IAST).

⚙️

Workload Protection

Uses eBPF to monitor file, network, and process activity at the kernel level. Detects privilege escalation, cryptomining, and unusual process behavior.

🔍

Audit Trail

Immutable audit log of all user and configuration changes across the Datadog platform — who changed what monitor, API key, or Agent config and when.

12 — Integrations

Integrations Ecosystem

Datadog ships 1,000+ vendor-backed integrations — each providing ready-made Agent checks, dashboards, and monitors. Connect to cloud providers, databases, message queues, CI/CD tools, and observability standards.

Cloud Platforms

CLOUDAWS

CLOUDGoogle Cloud

CLOUDAzure

CLOUDAlibaba Cloud

Databases & Caches

PostgreSQL

MySQL

MongoDB

Redis

Cassandra

Elasticsearch

Oracle

SQL Server

CockroachDB

Message Queues & Streaming

Apache Kafka

RabbitMQ

Amazon SQS / Kinesis

Google Pub/Sub

Web Servers & Proxies

Nginx

Apache

HAProxy

Envoy

Istio

Traefik

DevOps & CI/CD

GitHub Actions

GitLab

Jenkins

CircleCI

ArgoCD

Terraform

Ansible

Chef

Puppet

Alerting & Incident Management

PagerDuty

OpsGenie

VictorOps

Slack

Microsoft Teams

Jira

ServiceNow

Webhooks

Observability Standards

OpenTelemetry (OTLP)

Prometheus

StatsD

JMX / JVM metrics

Datadog Internal Developer Portal (IDP): Software Catalog + Self-Service Actions + Scorecards. Visualize service hierarchies, enable self-service infrastructure provisioning, and evaluate production-readiness before release.

13 — Tagging Strategy

Tags & Unified Service Tagging

Tags are key:value metadata attached to every metric, log, trace, and event. A consistent tagging strategy is the foundation of effective filtering, alerting, and root-cause analysis in Datadog.

Unified Service Tagging (Required)

Tag	Purpose	Example
env	Deployment environment	production
service	Service / application name	checkout-api
version	Deployed code version	1.4.2

ENV VARS — Kubernetes pod spec
env:
  - name:  DD_ENV
    value: production
  - name:  DD_SERVICE
    value: checkout-api
  - name:  DD_VERSION
    value: 1.4.2
  - name:  DD_AGENT_HOST
    valueFrom:
      fieldRef:
        fieldPath: status.hostIP

Tagging Best Practices

Use lowercase key:value format consistently everywhere
Always tag with env: (production, staging, dev)
Always tag with service: for service-level views
Always tag with version: for deployment tracking
Use team: for ownership routing in monitor notifications
Use region: and availability-zone: for geographic scoping
Avoid high-cardinality values on metrics (user_id, request_id) — use traces for those
Limit custom metric tag cardinality to control monthly custom metric costs
Apply tags at the Agent level for host-wide application

Tag Sources (All Applied Automatically)

Source	Example Tags
Agent config (datadog.yaml)	env:prod, team:platform
Cloud provider metadata	aws:us-east-1, instance_type:c5
Container labels / K8s labels	app:frontend, kube_namespace:default
Integration check config	db:postgres-prod
DogStatsD metric submission	endpoint:/checkout

14 — API, SDKs & Infrastructure as Code

Datadog API & IaC

Datadog exposes a comprehensive REST API for programmatic access to all platform resources. Official SDKs, Terraform provider, and datadog-ci CLI enable full Datadog-as-Code workflows.

REST API — Key Endpoints

Endpoint	Action
POST /api/v1/series	Submit custom metrics
POST /api/v2/logs/events	Send log events
GET /api/v1/monitors	List all monitors
POST /api/v1/monitor	Create a monitor
GET /api/v1/dashboard	List all dashboards
POST /api/v1/events	Post to Events Stream
POST /api/v2/query	Metrics query over time range

PYTHON — API client
from datadog_api_client import ApiClient, Configuration
from datadog_api_client.v1.api.monitors_api import MonitorsApi

config = Configuration()
config.api_key["apiKeyAuth"] = "<DD_API_KEY>"
config.api_key["appKeyAuth"] = "<DD_APP_KEY>"

with ApiClient(config) as api_client:
    api = MonitorsApi(api_client)
    print(api.list_monitors())

Terraform Provider

TERRAFORM — Provider setup
terraform {
  required_providers {
    datadog = {
      source  = "DataDog/datadog"
      version = "~> 3.0"
    }
  }
}
# Auth via DD_API_KEY + DD_APP_KEY env vars
provider "datadog" {
  api_url = "https://api.datadoghq.com/"
}

Terraform Resources

datadog_monitor

datadog_dashboard

datadog_service_level_objective

datadog_synthetics_test

datadog_logs_index

datadog_logs_pipeline

datadog_metric_tag_configuration

datadog_user / datadog_role

datadog_security_monitoring_rule

datadog_integration_aws

datadog_downtime

datadog-ci CLI

BASH
npm install -g @datadog/datadog-ci

# Upload source maps for RUM error tracking
datadog-ci sourcemaps upload ./dist \
  --service my-app --release-version 1.4.2

# Report CI test results
datadog-ci junit upload --service my-app ./test-results.xml

15 — Glossary

Key Terminology

Core terms used across the Datadog platform and documentation.

Agent

Open-source Go software that runs on monitored hosts to collect metrics, logs, traces, and events and forward them to Datadog.

DogStatsD

StatsD-compatible daemon built into the Agent for receiving custom application metrics over UDP (port 8125) or Unix socket.

APM

Application Performance Monitoring — distributed tracing, service maps, latency analysis, and Continuous Profiler for application code.

Span

A named, timed unit of work in a distributed trace. Represents one operation — an HTTP call, DB query, or function invocation.

Trace

A collection of spans representing the complete end-to-end journey of a single request through a distributed system.

Tag

A key:value pair attached to metrics, logs, traces, and events to enable filtering, scoping, and grouping in dashboards and monitors.

Unified Service Tagging

Applying env, service, and version tags consistently across all telemetry types, enabling seamless correlation between metrics, logs, and traces.

Monitor

A rule that evaluates metric, log, or trace data against conditions and triggers notifications when thresholds are crossed.

SLO

Service Level Objective — a measurable reliability target (e.g., 99.9% uptime) tracked over a rolling or calendar time window.

Watchdog

Datadog's ML-powered anomaly detection engine. Automatically surfaces unusual patterns in metrics, logs, and traces without manual threshold configuration.

Autodiscovery

Mechanism for automatically detecting and configuring integration checks based on container labels or Kubernetes annotations in dynamic environments.

Cluster Agent

A special Kubernetes deployment that efficiently coordinates monitoring data collection across an entire cluster, distributing checks to node Agents.

Forwarder

Agent component that buffers telemetry in memory and sends it to Datadog backend over HTTPS. Handles network interruptions without data loss.

RUM

Real User Monitoring — captures actual end-user interactions, page loads, JS errors, and performance metrics from real browsers and mobile apps.

Session Replay

Pixel-perfect playback of a real user session, showing exactly what the user experienced in their browser for UX debugging.

Synthetic Test

A scripted, scheduled test that proactively checks API endpoints or user journeys from Datadog-managed global locations.

Flex Logs

Cost-effective log storage tier supporting in-place search without rehydration. Flex Frozen sub-tier provides up to 7-year compliance retention.

Grok Parser

A pattern-based log parsing tool in Datadog pipelines that extracts structured attributes from raw unstructured log text using named capture groups.

Fleet Automation

Datadog feature for remotely managing, configuring, and upgrading all Agents across all environments directly from the Datadog UI.

Apdex

Application Performance Index — a 0–1 score measuring user satisfaction based on response time thresholds. Available for HTTP/web APM services.

DORA Metrics

Deployment Frequency, Lead Time, Change Failure Rate, and Time to Restore — tracked in Datadog IDP to measure software delivery performance.

Issue Correlation

AI-powered feature that automatically maps related issues across services, tracing problems to their true origin to reduce alert noise.

DatadogObservability Platform

What is Datadog?

Infrastructure Monitoring

APM & Distributed Tracing

Log Management

Monitors & Alerting

Security

AI & ML Features

How Datadog Works

Agent Internal Components

Data Retention

Agent Installation & Configuration

Installation

Core Configuration — datadog.yaml

Deployment Options

Integration Check Config

Metrics Collection & Custom Metrics

Metric Types

Submitting Custom Metrics via DogStatsD

APM Metric Namespaces

Log Collection, Pipelines & Storage

Enabling Log Collection

Log Limits

Storage Tiers

🔥 Standard Tier

❄️ Flex Logs

🗃 Archive Search

Grok Parser Example

Application Performance Monitoring

Instrumentation Methods

Single Step Instrumentation

Tracing Libraries

OpenTelemetry

Dynamic Instrumentation

Key APM Features

Service Map

Flame Graphs

Error Tracking

Deployment Tracking

Continuous Profiler

Trace-Log Correlation

Python APM

Java APM

OpenTelemetry Collector

Infrastructure & Container Monitoring

Infrastructure Views

Key System Metrics

Cloud Integrations

Kubernetes Cluster Agent

Serverless

Monitors, Alerts & SLOs

Monitor Types

Metric Monitor

Log Monitor

APM Monitor

Anomaly Monitor

Forecast Monitor

Synthetic Monitor

Service Check

Composite Monitor

Database Monitoring

Monitor — Terraform

Notification Channels

Monitor Configuration Reference

Service Level Objectives (SLOs)

Dashboards

Dashboard Types

Widget Types

Template Variables — Terraform

Synthetics & Real User Monitoring

Synthetic Test Types

Continuous Testing (CI/CD)

Real User Monitoring (RUM)

Feature Flags & A/B Testing

Security Products

Cloud SIEM

Cloud Security Management

App & API Protection

Code Security

Workload Protection

Datadog
Observability Platform