🧠Mental Model: What You're Managing

A datacenter cage is a locked, dedicated enclosure within a larger multi-tenant colocation facility. Colocation (colo) providers sell space, power, and connectivity — you own the hardware. The cage is your controlled micro-environment inside their building.

Think of it as a system with five distinct but interdependent layers. Failures in lower layers cascade upward; over-provisioning in one layer cannot compensate for a bottleneck in another.

L5 🤖 Control Layer Monitoring, automation, access control, runbooks
L4 🖥️ System Layer Servers, storage, networking hardware
L3 🗄️ Rack Layer Physical organization, U-space, cable management
L2 🔒 Cage Layer Physical security boundary, access logging
L1 Facility Layer Utility power, UPS, generators, CRAC/CRAH cooling, fire suppression

Core Mandate

Ensure availability, safety, performance, and traceability across all five layers. Your job is to ensure no bottleneck, no failure, no silent degradation.

SRE Mapping

If you think in pipelines, the datacenter maps cleanly:

⚡ Power
🌬️ Cooling
🌐 Network
🖥️ Compute
📊 Observability

Each "pipeline" has an input, a capacity, a failure mode, and a monitoring signal. Treat them accordingly.

🧱Cage Design & Layout

A cage is typically a welded or bolted mesh enclosure (floor-to-ceiling or raised) within a shared colo floor. Solid-wall cages offer more security; mesh cages allow easier visual inspection and air flow monitoring. Most enterprise deployments use mesh with solid-panel overlays for higher-security zones.

Core Design Concepts

  • Cold aisle / hot aisle containment — servers face the cold aisle (cool intake air), exhaust into the hot aisle. Containment systems prevent mixing of cold and hot air streams, dramatically improving cooling efficiency (PUE impact: ~0.1–0.2).
  • Density planning (kW per rack) — standard enterprise racks run 5–10 kW; GPU/AI racks can reach 20–80 kW. Know your density before placing hardware. Overcrowding causes thermal runaway.
  • Growth planning — maintain 30–50% free rack space and 20–30% free power headroom at all times. Running at 90%+ capacity removes your buffer for planned maintenance or emergency rerouting.

Physical Components

Cage Access
Biometric + PIN
Badge alone insufficient for Tier III/IV
Cameras
≥4 angles
Cover all rack fronts + cage entrance
Floor Type
Raised or Slab
Raised allows under-floor cable routing
Free Capacity
30–50%
Headroom for growth & maintenance

Layout Principles

  • Centralize network racks to minimize cable runs and latency
  • Separate compute, storage, and network into distinct rack zones
  • Isolate high-density GPU/AI racks — they require dedicated cooling circuits
  • Keep critical infrastructure (PDUs, OOB switches) accessible without disturbing production racks

Layout Rule

Minimize cable length, maximize airflow, and isolate failure domains. Every layout decision is a trade-off between these three.

Uptime Institute Tier Classification

Colo facilities are rated Tier I–IV. Your cage's reliability ceiling is determined by the facility tier.

TierRedundancyUptime SLAAnnual Downtime
Tier INo redundancy99.671%≤28.8 hrs
Tier IIRedundant components99.741%≤22 hrs
Tier IIIN+1, concurrently maintainable99.982%≤1.6 hrs
Tier IV2N, fault tolerant99.995%≤26.3 min

🗄️Rack Engineering

Standard Rack Specifications

Height
42U – 52U
1U = 1.75 inches (44.45 mm)
Width
19-inch
EIA-310 standard; outer width ~600mm
Depth
1000–1200mm
Deep racks needed for modern servers (800–900mm depth)
Max Load
900–1360 kg
Verify floor load rating before placing full racks
Typical Density
5–15 kW
Standard compute; GPU racks up to 80 kW
Air Direction
Front → Back
Standardize across all equipment

Vertical Rack Layout (Top → Bottom)

This ordering optimizes airflow, cable management, and center of gravity:

Top 🔌 Patch panels / Fiber trays Structured cabling termination; keep accessible
🌐 Top-of-Rack (ToR) switches 1–4U; short runs to all servers below
💨 Lightweight / 1U servers Management, edge, or utility nodes
🖥️ Heavy compute / storage servers Lower position = lower center of gravity = seismic safety
Side PDUs (vertical, side-mounted) One PDU-A (Feed A), one PDU-B (Feed B) per rack

Common Mistake

Never fill all U-spaces. Leave at least 1U blanking panel for every unused space — air leakage through empty slots can reduce cooling efficiency by 10–30%. Always fill gaps with blanking panels.

U-Space Planning

Track U-space in your DCIM or CMDB (e.g., NetBox). Each device consumes a specific U count. Common examples:

Device TypeTypical UExample
1U server1UDell PowerEdge R650, Supermicro 1029U
2U server2UDell R750, HPE ProLiant DL380 Gen10
4U GPU server4UNVIDIA DGX A100 (6U), Supermicro SYS-420GP
ToR switch1UArista 7050X3, Cisco Nexus 93180YC
Patch panel (24-port)1ULeviton, Panduit
KVM switch1URaritan, Raritan Dominion

Power Systems

Power is the most critical physical resource in a cage. A power failure cascades instantly to every system; no redundancy elsewhere compensates for lost power. Understand every step in the power chain.

Power Chain

Utility Grid
Transformer
ATS/STS
UPS
PDU
Server PSU

ATS = Automatic Transfer Switch; STS = Static Transfer Switch; UPS = Uninterruptible Power Supply; PDU = Power Distribution Unit

Redundancy Model

Enterprise standard is A/B power feeds — two completely independent paths from separate utility feeds, separate UPS systems, separate PDUs, to dual PSUs in each server.

Redundancy Level
2N (Tier IV)
Full duplication — most resilient
Standard Enterprise
N+1
One extra UPS/PDU per circuit
PSU Config
Dual PSU
One on Feed A, one on Feed B
UPS Runtime
10–30 min
Bridge until generator starts (<30 sec)

Hard Rule

Never exceed 80% of any circuit's rated capacity (NEC 80% rule for continuous loads). A 20A circuit = 16A usable. A 30A circuit = 24A usable. Sustained loads above 80% risk breaker trips and thermal damage.

Power Concepts

kW vs kVA

kVA is apparent power (what the PDU is rated for); kW is real power (what the server actually consumes). The ratio is the power factor (PF). Modern server PSUs typically have PF ≥ 0.95, so kW ≈ kVA × 0.95. Always plan budgets in kW (actual consumption), but size circuits in kVA (what the PDU/breaker must handle).

Phase Balancing

Three-phase power is standard in datacenters. Distribute load evenly across phases (L1, L2, L3) to avoid overloading a single phase. Target imbalance < 10% between phases. Measure at the PDU breaker level.

Power Measurement Tools

  • Smart PDUs (e.g., Raritan PX3, APC AP8900 series) — per-outlet monitoring, remote switching
  • DCIM platforms (e.g., Nlyte, Sunbird) — aggregate power dashboards
  • IPMI/BMC — per-server power consumption via Redfish or IPMI 2.0

PUE (Power Usage Effectiveness)

PUE = Total Facility Power ÷ IT Equipment Power. A PUE of 1.0 is theoretical perfection; modern facilities target 1.2–1.4. A PUE of 2.0 means as much energy is wasted on overhead (cooling, lighting) as powers IT gear — unacceptable by today's standards.

🌬️Cooling & Airflow

Thermal management is the second critical physical resource. Servers tolerate brief power interruption via UPS; they tolerate almost no thermal excursion — CPU throttling begins at ~70°C and emergency shutdown typically triggers at 85–95°C.

Cooling Models

ModelHow It WorksBest ForLimitation
Cold Aisle Containment (CAC) Enclose the cold aisle; servers draw cool air from inside the containment Standard compute, 5–15 kW/rack Hot exhaust enters open data hall
Hot Aisle Containment (HAC) Enclose the hot aisle; hot air is captured and returned directly to CRACs Higher efficiency than CAC, same density Hot aisle is inaccessible during operation
In-Row Cooling Cooling units placed between racks; cool air delivered at row level High-density rows, GPU clusters Higher CAPEX per kW cooled
Rear-Door Heat Exchanger Chilled water coil in rack rear door absorbs server exhaust directly Very high density (15–40 kW/rack) Requires chilled water plumbing per rack
Direct Liquid Cooling (DLC) Cold plates on CPUs/GPUs with liquid coolant; near-zero air cooling AI/ML racks, 40–100+ kW/rack Complex plumbing, leak risk, higher cost

ASHRAE Thermal Guidelines

ASHRAE TC 9.9 defines environmental envelopes for IT equipment:

Class A1 Inlet Temp
15–32°C
Recommended: 18–27°C
Class A2 Inlet Temp
10–35°C
Most modern servers qualify
Humidity (RH)
20–80%
Non-condensing; low RH = ESD risk
Dew Point
5.5–15°C
Per ASHRAE A2; critical for condensation prevention

Best Practices

  • Blanking panels — fill every unused rack U. Without them, cold air short-circuits from cold aisle through the rack to hot aisle without cooling anything
  • Cable cutout seals — use brush strips or grommets on raised floor cutouts to prevent hot-cold air mixing under-floor
  • Airflow monitoring — deploy temperature sensors at rack inlet (U1–U3) and outlet (top of rack). Alert on inlet >27°C or delta T >15°C
  • CRAC vs CRAH — CRAC (Computer Room Air Conditioner) uses DX refrigerant; CRAH (Computer Room Air Handler) uses chilled water. CRAH is more efficient at scale, requires chiller plant

Efficiency Tip

Every 1°C rise in server inlet temperature allows CRAC setpoint to rise ~1°C, reducing cooling energy ~2–4%. Raising setpoint from 18°C to 27°C can cut cooling energy by 15–40%.

🔌Cabling & Connectivity

Network Cable Types

TypeMax DistanceSpeedUse Case
Cat655m @ 10GUp to 10 GbEShort server-to-ToR runs
Cat6A100m @ 10GUp to 10 GbEStandard structured cabling
DAC (Direct Attach Copper)1–7m25G / 40G / 100GServer-to-ToR, very short high-speed
AOC (Active Optical Cable)Up to 100m25G / 100G / 400GCross-cage, spine interconnects
OM3/OM4 Fiber (Multi-mode)300m / 400m @ 10GUp to 100GWithin same facility floor
OS2 Fiber (Single-mode)10+ kmUp to 400G+Cross-facility, long-haul backbone

Power Cable Standards

  • C13/C14 — standard IEC 60320, up to 10A/15A; most 1U–2U servers
  • C19/C20 — heavy-duty IEC 60320, up to 16A/20A; high-power servers, GPUs, storage
  • NEMA L6-20 / L6-30 — locking connectors, used for PDU feeds in North American facilities
  • IEC 60309 — industrial connectors (16A/32A), common in European colo facilities

Cable Management Principles

  • Use horizontal cable managers (1U arm) at every patch panel and switch level
  • Vertical cable trays on both sides of the rack for power and data segregation
  • Label both ends of every cable — source and destination. Use consistent scheme: RACK-RU-PORT
  • Velcro ties, not zip ties — zip ties crush cable jackets and make changes destructive
  • Color coding example: Blue = data, Red = management/OOB, Yellow = SAN/storage, Black = power
  • Separate power and data cables in different trays to minimize EMI

Operations Rule

If you cannot trace a cable's path and purpose in under 10 seconds, your cable management is failing. Fix it before an incident forces you to do it under pressure.

Fiber Optic Handling

  • Never exceed the minimum bend radius (typically 10× cable diameter for fiber)
  • Use LC connectors for most server/switch connections; MPO/MTP for high-density breakout
  • Clean fiber end-faces before every insertion — a single dirty connector can degrade link performance or cause intermittent errors
  • Document fiber runs in your DCIM: port-to-port mapping, connector type, length, dB loss measurement

🖥️Server Components

Key Subsystems

SubsystemKey SpecFailure SignalMonitoring Method
CPUTDP, core count, frequencyThermal throttle, NMI, MCEIPMI sensors, OS metrics
RAM (DRAM)ECC DDR5, speed, capacityCorrectable/uncorrectable ECC errorsedac-util, IPMI, vendor BMC
NVMe / SSDDWPD, capacity, latencySMART errors, reallocated sectorssmartctl, nvme-cli
NICSpeed (25G/100G), port countCRC errors, packet drops, link flapethtool, SNMP, DCIM
PSUWattage, efficiency (80 Plus)PSU fault LED, IPMI alertIPMI, Redfish, smart PDU
BMC/IPMIOut-of-band management chipinaccessible consoleRegular connectivity checks
FansRPM, airflow CFMFan fault LED, high inlet tempIPMI sensor polling

Hot-Swap Components

These can be replaced without powering down the server (vendor-dependent — always verify):

  • PSU (when redundant)
  • Drives in a RAID array (with proper RAID rebuild procedure)
  • Fans (in most enterprise-grade servers)
  • NIC in PCIe hot-swap bays (rare; OCP 3.0 mezzanine cards on some platforms)

IPMI / BMC / Redfish

Every enterprise server has an out-of-band management interface separate from the main OS network. Vendor naming varies: iDRAC (Dell), iLO (HPE), IPMI/BMC (Supermicro, Lenovo). The modern standard API is Redfish (RESTful HTTPS, replaces legacy IPMI 2.0 LAN commands).

  • Always put BMC/iDRAC on a dedicated management VLAN (OOB network)
  • Set IPMI access to management network only — never expose to the internet
  • Use Redfish for programmatic provisioning and sensor polling

OOB Discipline

An unreachable BMC is a server you cannot recover without a physical site visit. Test OOB connectivity for every server at deploy time and include it in your monitoring. Down BMC = P2 incident in your runbook.

80 Plus Efficiency Standards

CertificationMin Efficiency @ 20%Min Efficiency @ 50%Min Efficiency @ 100%
80 Plus Bronze82%85%82%
80 Plus Gold87%90%87%
80 Plus Platinum90%92%89%
80 Plus Titanium90%94%91%

🌐Networking Architecture

Three-Tier vs Leaf-Spine

Traditional three-tier (access → aggregation → core) was designed for client-server traffic patterns with most traffic going north-south (user → server). It is increasingly inadequate for modern east-west traffic-heavy workloads (server-to-server, distributed systems, microservices).

Leaf-Spine Architecture (Modern Standard)

Every leaf switch connects to every spine switch. No leaf-to-leaf links. This provides predictable, low-latency, and equal-cost paths between any two servers in the fabric.

Server
ToR / Leaf
↕ ECMP
Spine
↕ ECMP
Border Leaf
External / WAN
  • ECMP (Equal-Cost Multi-Path) — multiple equal-cost paths are load-balanced, increasing effective bandwidth and providing automatic failover
  • BGP as the fabric underlay — eBGP is increasingly used as the routing protocol within the datacenter fabric (RFC 7938)
  • VXLAN overlay — tunneling protocol that extends Layer 2 segments over Layer 3 underlay; enables VM/workload mobility across the fabric
  • Typical oversubscription — leaf ports (server-facing) to spine uplinks: 3:1 to 6:1 for standard compute; 1:1 for latency-sensitive workloads

Key Protocols & Technologies

TechnologyLayerPurpose
VLANs (802.1Q)L2Traffic segmentation within a switch/fabric
LACP / MLAGL2Link aggregation; dual-homing servers to two ToR switches
OSPFL3Interior gateway protocol; often used in smaller fabrics
BGP (eBGP)L3Preferred underlay routing in large leaf-spine fabrics
VXLAN (RFC 7348)L3 overlayExtend L2 domains over L3 routed fabric
BFDL3Sub-second failure detection for BGP/OSPF sessions
RDMA / RoCETransportLow-latency networking for storage and HPC workloads

Design Principle

The network is the nervous system — design for predictable latency first, then redundancy, then throughput. A flapping link that comes and goes is worse than a link that is consistently down — the former causes intermittent application errors that are hard to diagnose.

🔐Physical & Logical Security

Physical Security Layers

  • Cage locks — electronic locks with audit logging; dual-factor (badge + PIN or biometric) for critical cages
  • Access logs — every cage entry/exit must be logged with timestamp and identity; retain for 90+ days minimum, often 12 months for compliance
  • Cameras — minimum coverage: cage entrance, all rack fronts. Motion-triggered recording; retain 30–90 days of footage
  • Escort policies — vendor technicians and colo staff must be escorted by your personnel; never allow unescorted access to your cage
  • Tamper-evident labels — on server panels and drive bays to detect unauthorized component access
  • Asset tagging — RFID or barcode on every device; reconcile against DCIM inventory quarterly

Logical Security

  • Dedicated management network (OOB) — BMC/iDRAC on a separate VLAN/subnet, isolated from production traffic, accessible only via jump host
  • Jump hosts (bastions) — all SSH/HTTPS access to servers routed through hardened bastion hosts with MFA and full session logging
  • Network segmentation — firewall between production, management, storage, and public network zones
  • Firmware / BIOS passwords — prevent unauthorized boot device changes or BIOS configuration
  • Secure boot — enabled on all servers to prevent boot-time malware
  • Drive encryption — full-disk encryption (AES-256) on all drives, especially in shared or multi-tenant environments

Zero Trust Principles

Apply zero-trust concepts even within the physical cage:

  • No implicit trust based on physical location (being inside the cage does not grant network access)
  • Authenticate every access — network, management plane, and physical
  • Least-privilege access — operators have access to only the racks they manage
  • Audit everything — access logs, CLI session recordings, change tickets

Compliance Note

Many frameworks (SOC 2, PCI-DSS, ISO 27001) require documented physical access controls, audit logs, and quarterly access reviews. Build these processes before your first audit, not during it.

📊Monitoring & Observability

Facility-Level Monitoring

  • PDU power draw — per-outlet kW, amps, power factor; alert on >80% circuit utilization
  • Temperature & humidity — rack-level sensors at U1 (inlet) and top-of-rack (exhaust); alert on inlet >27°C
  • Airflow — differential pressure sensors on cold aisle containment
  • UPS status — battery health, bypass status, runtime remaining
  • Generator test status — last tested, fuel level (facility-provided, but monitor

Hardware-Level Monitoring

  • CPU temperature — via IPMI/Redfish; alert if CPU package temp >70°C sustained
  • Memory ECC errors — correctable ECC errors are warning; uncorrectable = critical, immediate replacement
  • Disk health — SMART attributes (reallocated sectors, pending sectors, uncorrectable errors); NVMe wear indicators
  • Fan RPM — alert on fan failure or RPM significantly below expected
  • PSU status — fault/OK per PSU via IPMI
  • NIC errors — CRC errors, input errors, drops; alert on sustained non-zero error rates
  • BMC connectivity — alert if OOB/IPMI unreachable (cannot manage the server remotely)

Observability Stack (Common)

LayerOpen SourceCommercial
Metrics collectionPrometheus + node_exporter, IPMI exporterDatadog Agent, Telegraf
VisualizationGrafanaDatadog Dashboards, Splunk
AlertingAlertmanager, PagerDuty (integration)Datadog Alerts, OpsGenie
DCIM / InventoryNetBox, OpenDCIMNlyte, Sunbird, Device42
Log aggregationLoki + Promtail, OpenSearchDatadog Logs, Splunk

Per-Rack Dashboard (Recommended)

Build a per-rack view showing: power draw (A + B feeds), inlet temp, outlet temp, live alerts, hardware fault count, and top-consuming servers. This lets an on-call engineer assess rack health at a glance during an incident without needing multiple tool windows.

🤖Automation & Source of Truth

The Automation Stack

ToolRoleWhat It Manages
NetBoxSource of Truth / DCIMInventory, IP addresses, VLANs, rack layouts, cables
MAASBare metal provisioningPXE boot, OS install, cloud-init configuration
AnsibleConfiguration managementOS hardening, package state, service config, idempotent enforcement
TerraformInfrastructure as CodeCloud resources, network device config (via providers)
Foreman/SatelliteLifecycle managementProvisioning + configuration + patch management (alternative to MAAS+Ansible)

Ideal Provisioning Flow

1 📝 NetBox Define rack, device, IP, VLAN in NetBox as the record of intent
2 Physical Install Rack server, connect power (A+B), connect OOB and data NICs, label everything
3 🔄 MAAS / PXE Server boots via PXE, MAAS commissions it, OS installed via cloud-init
4 🔧 Ansible Runs playbooks to enforce desired state: packages, users, sysctl, monitoring agents
5 📊 Monitoring Node auto-registered in Prometheus/Datadog via service discovery; alerts active

Automation Principle

If you do it more than twice manually, automate it. If it takes more than 30 minutes to provision a new server, your automation is incomplete. Target: zero-touch provisioning from power-on to production-ready.

NetBox as Source of Truth

NetBox should be the authoritative record for: every rack, every device, every IP address, every VLAN, and every cable. All automation tooling reads from NetBox as its source of truth, not from ad-hoc scripts or spreadsheets. Treat NetBox updates as part of the change process — no physical change without a NetBox update.

🔥Failure Domains & Risk Management

Failure Mode Matrix

LayerFailure ModeImpactDetectionMitigation
Facility PowerUtility grid failureTotal DC outageUPS on-battery alertGenerator, UPS, 2N feeds
Rack PowerPDU circuit tripSingle rack partial/full lossPDU alert, server offlineDual PDU, A+B feeds, 80% rule
CoolingCRAC unit failureRising inlet temps, thermal throttleTemperature sensors, alertsN+1 CRAC units, hot-aisle containment
CoolingAirflow blockageHot spots, localized throttleRack inlet sensor spikeBlanking panels, cable management
NetworkToR switch failureRack network lossBGP/OSPF neighbor dropMLAG dual-homing to two ToR switches
NetworkSpine switch failureFabric partial capacityECMP path count dropN+2 spine switches, fast failover
HardwareDisk failureData degradation (RAID), potential data lossSMART alerts, RAID eventsRAID 10/6, prompt replacement, spare pool
HardwarePSU failurePotential server shutdownIPMI PSU fault alertDual PSU on A+B feeds
HumanWrong cable unpluggedUnexpected outageNetwork/server alertClear labeling, change management, lockout tags
HumanUnauthorized accessSecurity breach, data theftAccess log anomaly, CCTVDual-factor, escort policy, audit logs

Failure Domain Design

A failure domain is the blast radius of a single failure. The goal is to minimize the blast radius by isolating components at appropriate granularity:

  • Per-server — dual PSU, redundant NICs
  • Per-rack — dual PDUs on separate feeds; dual ToR switches via MLAG
  • Per-row — in-row cooling, separate power circuit
  • Per-cage — dedicated cage power feeds, separate cooling zone
  • Per-datacenter — geographic redundancy, site-to-site replication

Common Pitfall

A/B power redundancy only works if the A and B feeds are truly independent — separate utility feeds, separate UPS systems, separate PDU breakers. If both feeds share a single UPS or a single upstream breaker, you have the appearance of redundancy without the reality.

🧾Operational Checklists

Daily Checks

  • Review active alerts (power, temperature, network, hardware) — triage any P2+ alerts
  • Verify UPS status — all units on utility, no battery-only conditions
  • Check temperature dashboard — no rack inlets above 27°C
  • Verify all critical servers are reachable (ping / OOB connectivity)
  • Review access logs for unexpected cage entry events

Weekly Checks

  • Review power capacity — no circuit above 70% average utilization
  • Review network bandwidth utilization — no sustained link above 70%
  • Check disk health dashboard — any drives with elevated SMART errors
  • Verify cable integrity — no loose or unlabeled cables observed during any site visit
  • Review OOB (BMC/IPMI) reachability — all nodes responding

Monthly Checks

  • Audit physical inventory vs NetBox — walk the floor, reconcile discrepancies
  • Test failover paths — intentionally fail one PDU feed and verify A/B redundancy
  • Review and rotate credentials — PDU admin passwords, BMC accounts
  • Check spare parts inventory — drives, PSUs, cables, transceivers
  • Review access list — remove departed personnel

Quarterly Checks

  • Disaster recovery simulation — full tabletop or live failover test
  • Firmware updates — server BIOS, BMC/iDRAC, NIC firmware, switch OS
  • Capacity planning review — project 6-month power, space, and network growth
  • Review and update runbooks — ensure documentation reflects current architecture
  • Compliance access review — formal review of all physical and logical access rights

🧠Advanced Concepts

Capacity Planning

Capacity planning is the practice of ensuring your cage has sufficient headroom — in power, cooling, space, and network — to accommodate planned growth without emergency procurement. Key metrics:

  • Power budget per rack — track committed vs available kW per circuit; alert when committed reaches 70% of rated capacity
  • Network oversubscription ratios — leaf uplink bandwidth ÷ server-facing port bandwidth. Typical: 4:1 (standard compute), 1:1 or 2:1 (latency-sensitive or storage)
  • Space (U) utilization — per-rack and per-cage tracking; maintain 30% headroom for emergency rack swaps
  • Cooling envelope — calculate actual heat load vs CRAC/CRAH rated capacity; target <70% utilization of cooling capacity

High-Density / AI Racks

GPU and AI accelerator racks (NVIDIA DGX, AMD Instinct clusters) present fundamentally different engineering challenges than standard compute:

Power Density
20–100 kW
Per rack; vs 5–15 kW standard
Cooling Method
DLC / Rear-door
Air cooling insufficient above ~25 kW/rack
Network
200G–400G HDR/NDR
InfiniBand or 400GbE for GPU fabric
Floor Load
1200+ kg/rack
Verify structural floor capacity

Cost Awareness

Cost CategoryTypeTypical RangeOptimization Lever
Colo space (per cabinet/mo)OPEX$800–$3000/cabinetDensity optimization, right-sizing
Power (per kW/mo committed)OPEX$50–$150/kWImprove PUE, decommission idle servers
Cross-connect (per port/mo)OPEX$300–$1000/portConsolidate carriers, use transit IX
Server hardwareCAPEX$3k–$50k/serverBulk procurement, lease vs buy analysis
Network hardwareCAPEX$5k–$200k/switchWhitebox alternatives, open networking

Reliability Engineering

  • MTBF (Mean Time Between Failures) — average time between hardware failures. Higher MTBF = more reliable hardware. Use vendor-published MTBF as a planning input, not a guarantee.
  • MTTR (Mean Time To Repair/Restore) — average time to restore service after failure. This is the metric you control operationally: good runbooks, spare parts on-site, and practiced procedures reduce MTTR.
  • Availability = MTBF ÷ (MTBF + MTTR). A device with 10,000h MTBF and 2h MTTR has 99.98% availability.
  • SLOs for infrastructure — define and track SLOs for power availability, network uptime, and provisioning time. Without SLOs, you cannot tell if your infrastructure is improving.

Reliability Insight

MTTR matters more than MTBF at scale. In a fleet of 1,000 servers, you will have hardware failures every week regardless of MTBF. The question is not if something fails — it is how fast you detect and recover. Invest in detection and automation proportionally to your fleet size.