Datadog
Observability Platform
A comprehensive guide to Datadog's agent, metrics, logs, APM, infrastructure monitoring, alerting, dashboards, synthetics, security, and integrations — built from official documentation.
What is Datadog?
Datadog is a cloud-scale observability and security platform that unifies metrics, logs, traces, real user monitoring, synthetic testing, and security signals into a single pane of glass across any infrastructure.
Infrastructure Monitoring
Collect 75–100 system metrics every 15–20 seconds from hosts, containers, cloud services, and serverless functions via the lightweight Agent.
APM & Distributed Tracing
End-to-end distributed tracing with flame graphs, service maps, error tracking, deployment comparison, and Continuous Profiler.
Log Management
Centralized log ingestion, parsing pipelines, real-time Live Tail search, and Flex Logs tiered storage with up to 7-year retention.
Monitors & Alerting
Threshold, anomaly, forecast, composite, and ML-based monitors with multi-channel alerting via PagerDuty, Slack, OpsGenie, and more.
Security
Cloud SIEM, Cloud Security Management, App & API Protection, Code Security, and Workload Protection on one unified platform.
AI & ML Features
Watchdog AI for automated anomaly detection, root cause analysis, LLM Observability, and Issue Correlation across services.
Three pillars of observability: Datadog unifies Metrics (what is happening), Logs (why it happened), and Traces (where in the call chain). With APM enabled, the Agent auto-injects trace IDs into logs — a click on any log takes you directly to the correlated distributed trace.
How Datadog Works
A lightweight Agent deployed on each host collects and buffers telemetry, forwarding it to the Datadog SaaS backend over HTTPS (metrics) and SSL-encrypted TCP (logs) for processing, storage, and analysis.
Agent Internal Components
Collector — runs all configured checks and gathers metrics on a 15–20 second interval. Passes output to the local Aggregator and Forwarder.
Forwarder — sends payloads to Datadog over HTTPS. Buffers metrics in memory during network outages, preventing data loss. Discards oldest data only at memory limits.
APM Agent — separate optional process collecting distributed traces. Enabled by default. Listens on port 8126.
Process Agent — collects live process and container info. Requires explicit enablement for full process monitoring.
DogStatsD — Go implementation of StatsD. Accepts custom metrics over UDP (port 8125) or Unix socket, aggregates and forwards to backend.
Data Retention
| Data Type | Default Retention |
|---|---|
| Metrics | 15 months |
| Custom span-based metrics | 15 months |
| Indexed spans / traces | 15 days |
| Ingested spans (in-flight) | 15 minutes |
| Standard Tier logs | 3–30 days (configurable) |
| Flex Logs (frozen tier) | Up to 7 years |
| RUM sessions | 30 days |
Agent Installation & Configuration
The Datadog Agent is open-source software written in Go that runs on every monitored host. Agent 7 is the latest major version. Resource footprint: ~0.08% CPU avg, ~95 MB RAM, 880 MB–1.3 GB disk on a c5.xlarge.
Installation
BASH — Linux one-liner
# Install Agent 7 on Linux
DD_API_KEY="<YOUR_API_KEY>" DD_SITE="datadoghq.com" \
bash -c "$(curl -L https://install.datadoghq.com/scripts/install_script_agent7.sh)"
# Service management
sudo systemctl start datadog-agent
sudo systemctl stop datadog-agent
sudo datadog-agent status
DOCKER
docker run -d --name dd-agent \
-e DD_API_KEY="<API_KEY>" \
-e DD_SITE="datadoghq.com" \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
-v /proc/:/host/proc/:ro \
-v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
gcr.io/datadoghq/agent:7
KUBERNETES — Datadog Operator (recommended)
helm repo add datadog https://helm.datadoghq.com
helm install datadog-operator datadog/datadog-operator
# DatadogAgent custom resource
apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
name: datadog
spec:
global:
clusterName: my-cluster
credentials:
apiSecret:
secretName: datadog-secret
keyName: api-key
features:
apm:
enabled: true
logCollection:
enabled: true
containerCollectAll: true
liveProcesses:
enabled: true
Core Configuration — datadog.yaml
YAML — /etc/datadog-agent/datadog.yaml
api_key: <YOUR_API_KEY>
site: datadoghq.com # or datadoghq.eu, us3.datadoghq.com
hostname: my-host.example.com
## Tags applied to ALL telemetry from this host
tags:
- env:production
- team:platform
- region:us-east-1
## Enable features
logs_enabled: true
apm_config:
enabled: true
process_config:
process_collection:
enabled: true
## DogStatsD
dogstatsd_port: 8125
dogstatsd_socket: /var/run/datadog/dsd.socket
## Optional proxy
proxy:
https: http://proxy.corp:3128
Deployment Options
| Environment | Method |
|---|---|
| Linux | install_script_agent7.sh |
| Windows | MSI installer / Chocolatey |
| macOS | Homebrew / DMG |
| Docker | gcr.io/datadoghq/agent:7 |
| Kubernetes | Datadog Operator (recommended) |
| Kubernetes (alt) | datadog/datadog Helm chart |
| AWS ECS | Daemon service / Fargate sidecar |
| AWS Lambda | Lambda Extension / Forwarder |
| IoT | IoT Agent (lightweight binary) |
Integration Check Config
YAML — conf.d/nginx.d/conf.yaml
init_config:
instances:
- nginx_status_url: http://localhost/nginx_status
tags:
- service:nginx
- env:production
K8s — Autodiscovery annotations
annotations:
ad.datadoghq.com/nginx.check_names: '["nginx"]'
ad.datadoghq.com/nginx.init_configs: '[{}]'
ad.datadoghq.com/nginx.instances:
'[{"nginx_status_url":"http://%%host%%/nginx_status"}]'
Fleet Automation: Remotely configure, upgrade, and manage all Agents across all environments from the Datadog UI — no SSH required. Supports automatic rollback if an Agent fails to restart after upgrade.
Metrics Collection & Custom Metrics
Metrics are time-series data points identified by a name and tags. They can be collected by the Agent, submitted via API/DogStatsD, or imported from cloud provider integrations.
Metric Types
| Type | Description | Use Case |
|---|---|---|
| COUNT | Events in a flush interval | http.requests |
| RATE | Count divided by time interval | requests.per_second |
| GAUGE | Instantaneous value at flush | cpu.usage, mem.free |
| HISTOGRAM | Statistical distribution | response.time |
| DISTRIBUTION | Global percentile calculations | latency.p99 |
| SET | Count of unique elements | unique_users |
HISTOGRAM outputs: When using HISTOGRAM, DogStatsD automatically sends .avg, .count, .median, .95percentile, .max, and .min as separate metric streams.
Submitting Custom Metrics via DogStatsD
PYTHON
from datadog import initialize, statsd
initialize(statsd_host='localhost', statsd_port=8125)
# Gauge
statsd.gauge('app.queue.depth', 42,
tags=['env:prod', 'service:worker'])
# Increment counter
statsd.increment('app.page.views',
tags=['page:home'])
# Histogram (timing)
statsd.histogram('db.query.time', 0.042,
tags=['query:get_user'])
# Distribution (global percentiles)
statsd.distribution('api.response.time', 125.3,
tags=['endpoint:/checkout'])
APM Metric Namespaces
| Namespace | Captures |
|---|---|
| trace.<span>.hits | Request count per service |
| trace.<span>.errors | Error count per service |
| trace.<span>.apdex | Apdex score (HTTP/web) |
| runtime.* | Language runtime metrics |
Log Collection, Pipelines & Storage
Datadog Log Management centralizes logs from all sources with real-time search, Grok parsing pipelines, faceted exploration, alerting, and Flex Logs for long-term cost-effective retention.
Enabling Log Collection
YAML — datadog.yaml
logs_enabled: true
YAML — conf.d/python.d/conf.yaml
logs:
- type: file
path: /var/log/myapp/*.log
source: python
service: my-service
tags:
- env:production
DOCKER — label-based collection
labels:
com.datadoghq.ad.logs:
'[{"source":"nginx","service":"web"}]'
Log Limits
| Limit | Value |
|---|---|
| Max log size (HTTPS) | 1 MB |
| Recommended per log | < 25 KB |
| Agent auto-split threshold | 900 KB |
| Max tags per log event | 100 |
| Max JSON attributes | 256 |
| Max attribute key length | 50 chars |
Storage Tiers
🔥 Standard Tier
Fully indexed logs for real-time search, monitors, dashboards, Live Tail. Configurable 3–30 day retention. Full Log Explorer capabilities.
❄️ Flex Logs
Cost-effective tier for lower-query-frequency logs. In-place searchability without rehydration. Flex Frozen sub-tier stores up to 7 years for compliance and forensic investigation.
🗃 Archive Search
Query logs archived directly in cloud storage (S3, GCS, Azure Blob) or Flex Frozen — without exporting or rehydrating. Ideal for audits and long-range analytics.
Trace-log correlation: APM auto-injects trace IDs into logs. Clicking a log entry with a trace ID jumps immediately to the correlated trace — no manual query building.
Grok Parser Example
GROK — Apache combined access log rule
access_log %{ip:network.client.ip} %{notSpace:http.ident} %{notSpace:http.auth} \
\[%{date("dd/MMM/yyyy:HH:mm:ss Z"):date}\] \
"%{word:http.method} %{notSpace:http.url} %{notSpace:http.version}" \
%{integer:http.status_code} %{integer:network.bytes_written}
Application Performance Monitoring
Datadog APM provides end-to-end distributed tracing, flame graphs, service maps, deployment tracking, and Continuous Profiler — with deep correlation to logs, metrics, and RUM.
Instrumentation Methods
Single Step Instrumentation
Installs Agent + instruments app in one step — no code changes. The simplest starting point.
Tracing Libraries
Language-specific libraries for Python, Java, Go, Ruby, Node.js, .NET, PHP, C++, Rust.
OpenTelemetry
Send OTel metrics, traces, and logs into Datadog via the Collector with Datadog Exporter.
Dynamic Instrumentation
Add instrumentation to live running services via the Datadog UI — no code deploys or restarts required.
Key APM Features
Service Map
Auto-generated dependency map of all services with real-time error rates and latency per connection.
Flame Graphs
Full call tree of any trace with time-spent visualization. Identify slowest code paths instantly.
Error Tracking
Intelligent error grouping across services. Track new vs. regressing issues by deployed version.
Deployment Tracking
Compare error rate, latency, and throughput before/during/after each deployment. Auto-detect faulty deploys via Watchdog.
Continuous Profiler
Always-on low-overhead code profiling in production. See exactly which methods consume CPU, memory, and I/O.
Trace-Log Correlation
Trace IDs injected into logs. View logs side-by-side with the trace that generated them.
Python APM
PYTHON
pip install ddtrace
# Auto-instrument at startup (recommended)
DD_SERVICE="my-api" DD_ENV="prod" DD_VERSION="1.2.0" \
ddtrace-run python app.py
# Manual span creation
from ddtrace import tracer
with tracer.trace("db.query", resource="SELECT users") as span:
span.set_tag("db.type", "postgres")
result = db_query()
Java APM
JAVA — JVM flag
java -javaagent:/path/to/dd-java-agent.jar \
-Ddd.service=my-app \
-Ddd.env=production \
-Ddd.version=1.0.0 \
-jar app.jar
OpenTelemetry Collector
YAML — OTel Collector Datadog Exporter
exporters:
datadog:
api:
key: ${DD_API_KEY}
site: datadoghq.com
traces:
compute_stats_by_span_kind: true
service:
pipelines:
traces:
exporters: [datadog]
metrics:
exporters: [datadog]
Unified Service Tagging: Apply env, service, and version consistently across all telemetry types from a service to enable seamless pivoting between metrics, logs, and traces.
Infrastructure & Container Monitoring
Monitor hosts, containers, Kubernetes clusters, cloud services, and serverless functions from a unified Infrastructure List with real-time metrics, health status, and live process monitoring.
Infrastructure Views
- Infrastructure List — every host with key metrics and tag filtering
- Host Map — hexagonal heatmap of all hosts by any metric
- Containers page — resource metrics and faceted search across containers
- Container Images — every image in your env + vulnerability data
- Orchestrator Explorer — monitor pods, deployments, namespaces
- Control Plane Monitoring — API server, scheduler, controller manager, etcd
- Live Processes — real-time process list with CPU, memory, I/O
- Network Performance Monitoring — eBPF-based traffic flow visibility
Key System Metrics
| Category | Example Metrics |
|---|---|
| CPU | system.cpu.user, system.load.1 |
| Memory | system.mem.used, system.swap.used |
| Disk I/O | system.io.rkb_s, system.disk.used |
| Network | system.net.bytes_rcvd, bytes_sent |
Cloud Integrations
Kubernetes Cluster Agent
The Cluster Agent efficiently gathers monitoring data from across an orchestrated cluster. It distributes check configurations to node Agents and ensures only one instance of each check runs per workload — preventing duplicate data collection across replicas.
The Cluster Agent holds configs and dispatches them to node Agents every 10 seconds. If a node Agent stops reporting, the Cluster Agent removes it from the active pool and redistributes its configurations.
Serverless
For AWS Lambda, Datadog collects metrics, traces, and logs via the Lambda Extension (preferred, runs in-process) or Lambda Forwarder (CloudWatch-based). Supports enhanced Lambda metrics, cold start detection, and X-Ray integration.
Monitors, Alerts & SLOs
Monitors evaluate metric, log, or trace queries against defined conditions and trigger alerts with notifications to PagerDuty, Slack, email, OpsGenie, and more. Evaluation frequency defaults to 1 minute.
Monitor Types
Metric Monitor
Alert when a metric threshold is crossed over a rolling window. Simple or multi-alert modes grouped by any tag.
Log Monitor
Alert on indexed log count, attribute unique count, or measure. Supports group-by facets. Max 2-day rolling window.
APM Monitor
Monitor service APM metrics (hits, errors, latency) or alert on Trace Analytics Indexed Span patterns.
Anomaly Monitor
ML-based detection learns seasonal patterns. Alerts on statistically unexpected deviations without manual thresholds.
Forecast Monitor
Predicts when a metric will breach a threshold. Ideal for disk capacity and resource planning.
Synthetic Monitor
Alert when a Synthetic API test or browser test fails or exceeds latency thresholds.
Service Check
Alert based on OK / WARNING / CRITICAL status submitted by Agent integration checks.
Composite Monitor
Combine monitors with boolean logic (AND, OR, NOT). Alert only when multiple conditions are simultaneously true.
Database Monitoring
Alert on slow queries, connection pool saturation, replication lag for PostgreSQL, MySQL, SQL Server, Oracle.
Monitor — Terraform
TERRAFORM
resource "datadog_monitor" "high_cpu" {
name = "High CPU Usage"
type = "metric alert"
message = "CPU > 90% on {{host.name}} @pagerduty"
query = "avg(last_5m):avg:system.cpu.user{env:production} by {host} > 90"
monitor_thresholds {
critical = 90
warning = 75
}
notify_no_data = true
no_data_timeframe = 20
tags = ["env:production", "team:platform"]
}
Notification Channels
Monitor Configuration Reference
| Option | Description |
|---|---|
| evaluation_window | Time range for query (last_5m, last_1h) |
| evaluation_frequency | How often query runs (default 1 min) |
| critical | Value triggering ALERT state |
| warning | Value triggering WARNING state |
| notify_no_data | Alert if no data is received |
| renotify_interval | Re-alert on sustained state (minutes) |
| require_full_window | Only evaluate with complete data window |
| multi_alert | Separate alert per dimension (e.g., per host) |
Service Level Objectives (SLOs)
Three SLO types:
| Type | Based On |
|---|---|
| Metric-based | Good events / total events ratio |
| Monitor-based | Uptime % derived from monitor state |
| Time Slice | % of time windows metric was within threshold |
Dashboards
Dashboards provide real-time insight into system health and business KPIs. Build from any combination of metrics, logs, traces, RUM, and events with template variables for dynamic scoping.
Dashboard Types
| Type | Use Case |
|---|---|
| Timeboard | All widgets share the same time range. Best for metric correlation during investigations. |
| Screenboard | Free-form layout with independent time ranges per widget. Best for NOC status displays. |
| Notebook | Markdown + live graphs. Best for postmortems, runbooks, and incident investigations. |
Widget Types
Template Variables — Terraform
TERRAFORM
resource "datadog_dashboard" "service_health" {
title = "Service Health"
layout_type = "ordered"
template_variable {
name = "env"
prefix = "env"
default = "production"
}
widget {
timeseries_definition {
request {
q = "avg:trace.web.request.duration{$env} by {service}"
}
title = "Request Latency by Service"
}
}
}
Datadog Sheets: Spreadsheet-style interface for analyzing telemetry — perform lookups, build pivot tables, create calculated columns, join datasets. Results can be added to dashboards or shared with colleagues.
Synthetics & Real User Monitoring
Synthetics proactively tests endpoints and journeys from Datadog-managed global locations. RUM captures real user interactions and performance from actual browsers and mobile apps.
Synthetic Test Types
| Type | Description |
|---|---|
| API Test | HTTP, gRPC, WebSocket, TCP, SSL, DNS checks. Assert on status codes, body, headers, latency. |
| Multistep API | Chain multiple API calls with variables passed between steps. Test full auth + action flows. |
| Browser Test | Headless Chrome tests that record and replay user journeys. Detect visual regressions and broken UI. |
| Mobile Test | Native iOS and Android app testing with real device simulation. |
Continuous Testing (CI/CD)
BASH — datadog-ci
npm install -g @datadog/datadog-ci
# Run Synthetic tests in CI pipeline
datadog-ci synthetics run-tests \
--public-id "abc-123-xyz" \
--apiKey $DD_API_KEY \
--failOnCriticalErrors
Real User Monitoring (RUM)
- Session Replay — pixel-perfect video-like replay of real user sessions
- Core Web Vitals — LCP, FID, CLS tracking with custom user timings
- Error Tracking — frontend JS errors grouped and prioritized by user impact
- RUM-APM Correlation — link frontend sessions to backend distributed traces
- Funnel Analysis — track conversion rates through multi-step user flows
- RUM Recommendations — AI-powered performance improvement suggestions (Preview)
- Mobile RUM — iOS and Android performance monitoring and crash tracking
Feature Flags & A/B Testing
Datadog Feature Flags integrates with your existing feature flag provider to track flag evaluations alongside RUM data. Correlate feature flag rollouts directly with performance regressions and error spikes in the same view.
Security Products
Datadog unifies observability and security on one platform — eliminating the context-switching between tools that slows down incident response when a performance issue has a security dimension.
Cloud SIEM
Detect, investigate, and respond to security threats across cloud and on-premises systems. Correlates logs, metrics, and network data to surface high-fidelity signals.
Cloud Security Management
Continuously audits cloud configurations, assesses identity risks (CIEM), and detects runtime threats across AWS, GCP, and Azure.
App & API Protection
Detects and blocks threats targeting production applications and APIs in real time, with APM trace context for each attack signal.
Code Security
Detects and fixes vulnerabilities in first-party code, open-source dependencies (SCA), and infrastructure-as-code from dev through runtime (IAST).
Workload Protection
Uses eBPF to monitor file, network, and process activity at the kernel level. Detects privilege escalation, cryptomining, and unusual process behavior.
Audit Trail
Immutable audit log of all user and configuration changes across the Datadog platform — who changed what monitor, API key, or Agent config and when.
Integrations Ecosystem
Datadog ships 1,000+ vendor-backed integrations — each providing ready-made Agent checks, dashboards, and monitors. Connect to cloud providers, databases, message queues, CI/CD tools, and observability standards.
Cloud Platforms
Databases & Caches
Message Queues & Streaming
Web Servers & Proxies
DevOps & CI/CD
Alerting & Incident Management
Observability Standards
Datadog Internal Developer Portal (IDP): Software Catalog + Self-Service Actions + Scorecards. Visualize service hierarchies, enable self-service infrastructure provisioning, and evaluate production-readiness before release.
Tags & Unified Service Tagging
Tags are key:value metadata attached to every metric, log, trace, and event. A consistent tagging strategy is the foundation of effective filtering, alerting, and root-cause analysis in Datadog.
Unified Service Tagging (Required)
| Tag | Purpose | Example |
|---|---|---|
| env | Deployment environment | production |
| service | Service / application name | checkout-api |
| version | Deployed code version | 1.4.2 |
ENV VARS — Kubernetes pod spec
env:
- name: DD_ENV
value: production
- name: DD_SERVICE
value: checkout-api
- name: DD_VERSION
value: 1.4.2
- name: DD_AGENT_HOST
valueFrom:
fieldRef:
fieldPath: status.hostIP
Tagging Best Practices
- Use lowercase
key:valueformat consistently everywhere - Always tag with
env:(production, staging, dev) - Always tag with
service:for service-level views - Always tag with
version:for deployment tracking - Use
team:for ownership routing in monitor notifications - Use
region:andavailability-zone:for geographic scoping - Avoid high-cardinality values on metrics (user_id, request_id) — use traces for those
- Limit custom metric tag cardinality to control monthly custom metric costs
- Apply tags at the Agent level for host-wide application
Tag Sources (All Applied Automatically)
| Source | Example Tags |
|---|---|
| Agent config (datadog.yaml) | env:prod, team:platform |
| Cloud provider metadata | aws:us-east-1, instance_type:c5 |
| Container labels / K8s labels | app:frontend, kube_namespace:default |
| Integration check config | db:postgres-prod |
| DogStatsD metric submission | endpoint:/checkout |
Datadog API & IaC
Datadog exposes a comprehensive REST API for programmatic access to all platform resources. Official SDKs, Terraform provider, and datadog-ci CLI enable full Datadog-as-Code workflows.
REST API — Key Endpoints
| Endpoint | Action |
|---|---|
| POST /api/v1/series | Submit custom metrics |
| POST /api/v2/logs/events | Send log events |
| GET /api/v1/monitors | List all monitors |
| POST /api/v1/monitor | Create a monitor |
| GET /api/v1/dashboard | List all dashboards |
| POST /api/v1/events | Post to Events Stream |
| POST /api/v2/query | Metrics query over time range |
PYTHON — API client
from datadog_api_client import ApiClient, Configuration
from datadog_api_client.v1.api.monitors_api import MonitorsApi
config = Configuration()
config.api_key["apiKeyAuth"] = "<DD_API_KEY>"
config.api_key["appKeyAuth"] = "<DD_APP_KEY>"
with ApiClient(config) as api_client:
api = MonitorsApi(api_client)
print(api.list_monitors())
Terraform Provider
TERRAFORM — Provider setup
terraform {
required_providers {
datadog = {
source = "DataDog/datadog"
version = "~> 3.0"
}
}
}
# Auth via DD_API_KEY + DD_APP_KEY env vars
provider "datadog" {
api_url = "https://api.datadoghq.com/"
}
Terraform Resources
datadog-ci CLI
BASH
npm install -g @datadog/datadog-ci
# Upload source maps for RUM error tracking
datadog-ci sourcemaps upload ./dist \
--service my-app --release-version 1.4.2
# Report CI test results
datadog-ci junit upload --service my-app ./test-results.xml
Key Terminology
Core terms used across the Datadog platform and documentation.