Datacenter Cage Engineering Guide

Section 01

🧠Mental Model: What You're Managing

A datacenter cage is a locked, dedicated enclosure within a larger multi-tenant colocation facility. Colocation (colo) providers sell space, power, and connectivity — you own the hardware. The cage is your controlled micro-environment inside their building.

Think of it as a system with five distinct but interdependent layers. Failures in lower layers cascade upward; over-provisioning in one layer cannot compensate for a bottleneck in another.

          L5
          🤖
          Control Layer
          Monitoring, automation, access control, runbooks
        

          L4
          🖥️
          System Layer
          Servers, storage, networking hardware
        

          L3
          🗄️
          Rack Layer
          Physical organization, U-space, cable management
        

          L2
          🔒
          Cage Layer
          Physical security boundary, access logging
        

          L1
          ⚡
          Facility Layer
          Utility power, UPS, generators, CRAC/CRAH cooling, fire suppression
        

Core Mandate

Ensure availability, safety, performance, and traceability across all five layers. Your job is to ensure no bottleneck, no failure, no silent degradation.

SRE Mapping

If you think in pipelines, the datacenter maps cleanly:

⚡ Power

→

🌬️ Cooling

→

🌐 Network

→

🖥️ Compute

→

📊 Observability

Each "pipeline" has an input, a capacity, a failure mode, and a monitoring signal. Treat them accordingly.

Section 02

🧱Cage Design & Layout

A cage is typically a welded or bolted mesh enclosure (floor-to-ceiling or raised) within a shared colo floor. Solid-wall cages offer more security; mesh cages allow easier visual inspection and air flow monitoring. Most enterprise deployments use mesh with solid-panel overlays for higher-security zones.

Core Design Concepts

Cold aisle / hot aisle containment — servers face the cold aisle (cool intake air), exhaust into the hot aisle. Containment systems prevent mixing of cold and hot air streams, dramatically improving cooling efficiency (PUE impact: ~0.1–0.2).
Density planning (kW per rack) — standard enterprise racks run 5–10 kW; GPU/AI racks can reach 20–80 kW. Know your density before placing hardware. Overcrowding causes thermal runaway.
Growth planning — maintain 30–50% free rack space and 20–30% free power headroom at all times. Running at 90%+ capacity removes your buffer for planned maintenance or emergency rerouting.

Physical Components

Cage Access

Biometric + PIN

Badge alone insufficient for Tier III/IV

Cameras

≥4 angles

Cover all rack fronts + cage entrance

Floor Type

Raised or Slab

Raised allows under-floor cable routing

Free Capacity

30–50%

Headroom for growth & maintenance

Layout Principles

Centralize network racks to minimize cable runs and latency
Separate compute, storage, and network into distinct rack zones
Isolate high-density GPU/AI racks — they require dedicated cooling circuits
Keep critical infrastructure (PDUs, OOB switches) accessible without disturbing production racks

Layout Rule

Minimize cable length, maximize airflow, and isolate failure domains. Every layout decision is a trade-off between these three.

Uptime Institute Tier Classification

Colo facilities are rated Tier I–IV. Your cage's reliability ceiling is determined by the facility tier.

Tier	Redundancy	Uptime SLA	Annual Downtime
Tier I	No redundancy	99.671%	≤28.8 hrs
Tier II	Redundant components	99.741%	≤22 hrs
Tier III	N+1, concurrently maintainable	99.982%	≤1.6 hrs
Tier IV	2N, fault tolerant	99.995%	≤26.3 min

Section 03

🗄️Rack Engineering

Standard Rack Specifications

Height

42U – 52U

1U = 1.75 inches (44.45 mm)

Width

19-inch

EIA-310 standard; outer width ~600mm

Depth

1000–1200mm

Deep racks needed for modern servers (800–900mm depth)

Max Load

900–1360 kg

Verify floor load rating before placing full racks

Typical Density

5–15 kW

Standard compute; GPU racks up to 80 kW

Air Direction

Front → Back

Standardize across all equipment

Vertical Rack Layout (Top → Bottom)

This ordering optimizes airflow, cable management, and center of gravity:

          Top
          🔌
          Patch panels / Fiber trays
          Structured cabling termination; keep accessible
        
          🌐
          Top-of-Rack (ToR) switches
          1–4U; short runs to all servers below
        
          💨
          Lightweight / 1U servers
          Management, edge, or utility nodes
        
          🖥️
          Heavy compute / storage servers
          Lower position = lower center of gravity = seismic safety
        
          Side
          ⚡
          PDUs (vertical, side-mounted)
          One PDU-A (Feed A), one PDU-B (Feed B) per rack

Common Mistake

Never fill all U-spaces. Leave at least 1U blanking panel for every unused space — air leakage through empty slots can reduce cooling efficiency by 10–30%. Always fill gaps with blanking panels.

U-Space Planning

Track U-space in your DCIM or CMDB (e.g., NetBox). Each device consumes a specific U count. Common examples:

Device Type	Typical U	Example
1U server	1U	Dell PowerEdge R650, Supermicro 1029U
2U server	2U	Dell R750, HPE ProLiant DL380 Gen10
4U GPU server	4U	NVIDIA DGX A100 (6U), Supermicro SYS-420GP
ToR switch	1U	Arista 7050X3, Cisco Nexus 93180YC
Patch panel (24-port)	1U	Leviton, Panduit
KVM switch	1U	Raritan, Raritan Dominion

Section 04

⚡Power Systems

Power is the most critical physical resource in a cage. A power failure cascades instantly to every system; no redundancy elsewhere compensates for lost power. Understand every step in the power chain.

Power Chain

Utility Grid

→

Transformer

→

ATS/STS

→

UPS

→

PDU

→

Server PSU

ATS = Automatic Transfer Switch; STS = Static Transfer Switch; UPS = Uninterruptible Power Supply; PDU = Power Distribution Unit

Redundancy Model

Enterprise standard is A/B power feeds — two completely independent paths from separate utility feeds, separate UPS systems, separate PDUs, to dual PSUs in each server.

Redundancy Level

2N (Tier IV)

Full duplication — most resilient

Standard Enterprise

N+1

One extra UPS/PDU per circuit

PSU Config

Dual PSU

One on Feed A, one on Feed B

UPS Runtime

10–30 min

Bridge until generator starts (<30 sec)

Hard Rule

Never exceed 80% of any circuit's rated capacity (NEC 80% rule for continuous loads). A 20A circuit = 16A usable. A 30A circuit = 24A usable. Sustained loads above 80% risk breaker trips and thermal damage.

Power Concepts

kW vs kVA

kVA is apparent power (what the PDU is rated for); kW is real power (what the server actually consumes). The ratio is the power factor (PF). Modern server PSUs typically have PF ≥ 0.95, so kW ≈ kVA × 0.95. Always plan budgets in kW (actual consumption), but size circuits in kVA (what the PDU/breaker must handle).

Phase Balancing

Three-phase power is standard in datacenters. Distribute load evenly across phases (L1, L2, L3) to avoid overloading a single phase. Target imbalance < 10% between phases. Measure at the PDU breaker level.

Power Measurement Tools

Smart PDUs (e.g., Raritan PX3, APC AP8900 series) — per-outlet monitoring, remote switching
DCIM platforms (e.g., Nlyte, Sunbird) — aggregate power dashboards
IPMI/BMC — per-server power consumption via Redfish or IPMI 2.0

PUE (Power Usage Effectiveness)

PUE = Total Facility Power ÷ IT Equipment Power. A PUE of 1.0 is theoretical perfection; modern facilities target 1.2–1.4. A PUE of 2.0 means as much energy is wasted on overhead (cooling, lighting) as powers IT gear — unacceptable by today's standards.

Section 05

🌬️Cooling & Airflow

Thermal management is the second critical physical resource. Servers tolerate brief power interruption via UPS; they tolerate almost no thermal excursion — CPU throttling begins at ~70°C and emergency shutdown typically triggers at 85–95°C.

Cooling Models

Model	How It Works	Best For	Limitation
Cold Aisle Containment (CAC)	Enclose the cold aisle; servers draw cool air from inside the containment	Standard compute, 5–15 kW/rack	Hot exhaust enters open data hall
Hot Aisle Containment (HAC)	Enclose the hot aisle; hot air is captured and returned directly to CRACs	Higher efficiency than CAC, same density	Hot aisle is inaccessible during operation
In-Row Cooling	Cooling units placed between racks; cool air delivered at row level	High-density rows, GPU clusters	Higher CAPEX per kW cooled
Rear-Door Heat Exchanger	Chilled water coil in rack rear door absorbs server exhaust directly	Very high density (15–40 kW/rack)	Requires chilled water plumbing per rack
Direct Liquid Cooling (DLC)	Cold plates on CPUs/GPUs with liquid coolant; near-zero air cooling	AI/ML racks, 40–100+ kW/rack	Complex plumbing, leak risk, higher cost

ASHRAE Thermal Guidelines

ASHRAE TC 9.9 defines environmental envelopes for IT equipment:

Class A1 Inlet Temp

15–32°C

Recommended: 18–27°C

Class A2 Inlet Temp

10–35°C

Most modern servers qualify

Humidity (RH)

20–80%

Non-condensing; low RH = ESD risk

Dew Point

5.5–15°C

Per ASHRAE A2; critical for condensation prevention

Best Practices

Blanking panels — fill every unused rack U. Without them, cold air short-circuits from cold aisle through the rack to hot aisle without cooling anything
Cable cutout seals — use brush strips or grommets on raised floor cutouts to prevent hot-cold air mixing under-floor
Airflow monitoring — deploy temperature sensors at rack inlet (U1–U3) and outlet (top of rack). Alert on inlet >27°C or delta T >15°C
CRAC vs CRAH — CRAC (Computer Room Air Conditioner) uses DX refrigerant; CRAH (Computer Room Air Handler) uses chilled water. CRAH is more efficient at scale, requires chiller plant

Efficiency Tip

Every 1°C rise in server inlet temperature allows CRAC setpoint to rise ~1°C, reducing cooling energy ~2–4%. Raising setpoint from 18°C to 27°C can cut cooling energy by 15–40%.

Section 06

🔌Cabling & Connectivity

Network Cable Types

Type	Max Distance	Speed	Use Case
Cat6	55m @ 10G	Up to 10 GbE	Short server-to-ToR runs
Cat6A	100m @ 10G	Up to 10 GbE	Standard structured cabling
DAC (Direct Attach Copper)	1–7m	25G / 40G / 100G	Server-to-ToR, very short high-speed
AOC (Active Optical Cable)	Up to 100m	25G / 100G / 400G	Cross-cage, spine interconnects
OM3/OM4 Fiber (Multi-mode)	300m / 400m @ 10G	Up to 100G	Within same facility floor
OS2 Fiber (Single-mode)	10+ km	Up to 400G+	Cross-facility, long-haul backbone

Power Cable Standards

C13/C14 — standard IEC 60320, up to 10A/15A; most 1U–2U servers
C19/C20 — heavy-duty IEC 60320, up to 16A/20A; high-power servers, GPUs, storage
NEMA L6-20 / L6-30 — locking connectors, used for PDU feeds in North American facilities
IEC 60309 — industrial connectors (16A/32A), common in European colo facilities

Cable Management Principles

Use horizontal cable managers (1U arm) at every patch panel and switch level
Vertical cable trays on both sides of the rack for power and data segregation
Label both ends of every cable — source and destination. Use consistent scheme: RACK-RU-PORT
Velcro ties, not zip ties — zip ties crush cable jackets and make changes destructive
Color coding example: Blue = data, Red = management/OOB, Yellow = SAN/storage, Black = power
Separate power and data cables in different trays to minimize EMI

Operations Rule

If you cannot trace a cable's path and purpose in under 10 seconds, your cable management is failing. Fix it before an incident forces you to do it under pressure.

Fiber Optic Handling

Never exceed the minimum bend radius (typically 10× cable diameter for fiber)
Use LC connectors for most server/switch connections; MPO/MTP for high-density breakout
Clean fiber end-faces before every insertion — a single dirty connector can degrade link performance or cause intermittent errors
Document fiber runs in your DCIM: port-to-port mapping, connector type, length, dB loss measurement

Section 07

🖥️Server Components

Key Subsystems

Subsystem	Key Spec	Failure Signal	Monitoring Method
CPU	TDP, core count, frequency	Thermal throttle, NMI, MCE	IPMI sensors, OS metrics
RAM (DRAM)	ECC DDR5, speed, capacity	Correctable/uncorrectable ECC errors	edac-util, IPMI, vendor BMC
NVMe / SSD	DWPD, capacity, latency	SMART errors, reallocated sectors	smartctl, nvme-cli
NIC	Speed (25G/100G), port count	CRC errors, packet drops, link flap	ethtool, SNMP, DCIM
PSU	Wattage, efficiency (80 Plus)	PSU fault LED, IPMI alert	IPMI, Redfish, smart PDU
BMC/IPMI	Out-of-band management chip	inaccessible console	Regular connectivity checks
Fans	RPM, airflow CFM	Fan fault LED, high inlet temp	IPMI sensor polling

Hot-Swap Components

These can be replaced without powering down the server (vendor-dependent — always verify):

PSU (when redundant)
Drives in a RAID array (with proper RAID rebuild procedure)
Fans (in most enterprise-grade servers)
NIC in PCIe hot-swap bays (rare; OCP 3.0 mezzanine cards on some platforms)

IPMI / BMC / Redfish

Every enterprise server has an out-of-band management interface separate from the main OS network. Vendor naming varies: iDRAC (Dell), iLO (HPE), IPMI/BMC (Supermicro, Lenovo). The modern standard API is Redfish (RESTful HTTPS, replaces legacy IPMI 2.0 LAN commands).

Always put BMC/iDRAC on a dedicated management VLAN (OOB network)
Set IPMI access to management network only — never expose to the internet
Use Redfish for programmatic provisioning and sensor polling

OOB Discipline

An unreachable BMC is a server you cannot recover without a physical site visit. Test OOB connectivity for every server at deploy time and include it in your monitoring. Down BMC = P2 incident in your runbook.

80 Plus Efficiency Standards

Certification	Min Efficiency @ 20%	Min Efficiency @ 50%	Min Efficiency @ 100%
80 Plus Bronze	82%	85%	82%
80 Plus Gold	87%	90%	87%
80 Plus Platinum	90%	92%	89%
80 Plus Titanium	90%	94%	91%

Section 08

🌐Networking Architecture

Three-Tier vs Leaf-Spine

Traditional three-tier (access → aggregation → core) was designed for client-server traffic patterns with most traffic going north-south (user → server). It is increasingly inadequate for modern east-west traffic-heavy workloads (server-to-server, distributed systems, microservices).

Leaf-Spine Architecture (Modern Standard)

Every leaf switch connects to every spine switch. No leaf-to-leaf links. This provides predictable, low-latency, and equal-cost paths between any two servers in the fabric.

Server

→

ToR / Leaf

↕ ECMP

Spine

↕ ECMP

Border Leaf

→

External / WAN

ECMP (Equal-Cost Multi-Path) — multiple equal-cost paths are load-balanced, increasing effective bandwidth and providing automatic failover
BGP as the fabric underlay — eBGP is increasingly used as the routing protocol within the datacenter fabric (RFC 7938)
VXLAN overlay — tunneling protocol that extends Layer 2 segments over Layer 3 underlay; enables VM/workload mobility across the fabric
Typical oversubscription — leaf ports (server-facing) to spine uplinks: 3:1 to 6:1 for standard compute; 1:1 for latency-sensitive workloads

Key Protocols & Technologies

Technology	Layer	Purpose
VLANs (802.1Q)	L2	Traffic segmentation within a switch/fabric
LACP / MLAG	L2	Link aggregation; dual-homing servers to two ToR switches
OSPF	L3	Interior gateway protocol; often used in smaller fabrics
BGP (eBGP)	L3	Preferred underlay routing in large leaf-spine fabrics
VXLAN (RFC 7348)	L3 overlay	Extend L2 domains over L3 routed fabric
BFD	L3	Sub-second failure detection for BGP/OSPF sessions
RDMA / RoCE	Transport	Low-latency networking for storage and HPC workloads

Design Principle

The network is the nervous system — design for predictable latency first, then redundancy, then throughput. A flapping link that comes and goes is worse than a link that is consistently down — the former causes intermittent application errors that are hard to diagnose.

Section 09

🔐Physical & Logical Security

Physical Security Layers

Cage locks — electronic locks with audit logging; dual-factor (badge + PIN or biometric) for critical cages
Access logs — every cage entry/exit must be logged with timestamp and identity; retain for 90+ days minimum, often 12 months for compliance
Cameras — minimum coverage: cage entrance, all rack fronts. Motion-triggered recording; retain 30–90 days of footage
Escort policies — vendor technicians and colo staff must be escorted by your personnel; never allow unescorted access to your cage
Tamper-evident labels — on server panels and drive bays to detect unauthorized component access
Asset tagging — RFID or barcode on every device; reconcile against DCIM inventory quarterly

Logical Security

Dedicated management network (OOB) — BMC/iDRAC on a separate VLAN/subnet, isolated from production traffic, accessible only via jump host
Jump hosts (bastions) — all SSH/HTTPS access to servers routed through hardened bastion hosts with MFA and full session logging
Network segmentation — firewall between production, management, storage, and public network zones
Firmware / BIOS passwords — prevent unauthorized boot device changes or BIOS configuration
Secure boot — enabled on all servers to prevent boot-time malware
Drive encryption — full-disk encryption (AES-256) on all drives, especially in shared or multi-tenant environments

Zero Trust Principles

Apply zero-trust concepts even within the physical cage:

No implicit trust based on physical location (being inside the cage does not grant network access)
Authenticate every access — network, management plane, and physical
Least-privilege access — operators have access to only the racks they manage
Audit everything — access logs, CLI session recordings, change tickets

Compliance Note

Many frameworks (SOC 2, PCI-DSS, ISO 27001) require documented physical access controls, audit logs, and quarterly access reviews. Build these processes before your first audit, not during it.

Section 10

📊Monitoring & Observability

Facility-Level Monitoring

PDU power draw — per-outlet kW, amps, power factor; alert on >80% circuit utilization
Temperature & humidity — rack-level sensors at U1 (inlet) and top-of-rack (exhaust); alert on inlet >27°C
Airflow — differential pressure sensors on cold aisle containment
UPS status — battery health, bypass status, runtime remaining
Generator test status — last tested, fuel level (facility-provided, but monitor

Hardware-Level Monitoring

CPU temperature — via IPMI/Redfish; alert if CPU package temp >70°C sustained
Memory ECC errors — correctable ECC errors are warning; uncorrectable = critical, immediate replacement
Disk health — SMART attributes (reallocated sectors, pending sectors, uncorrectable errors); NVMe wear indicators
Fan RPM — alert on fan failure or RPM significantly below expected
PSU status — fault/OK per PSU via IPMI
NIC errors — CRC errors, input errors, drops; alert on sustained non-zero error rates
BMC connectivity — alert if OOB/IPMI unreachable (cannot manage the server remotely)

Observability Stack (Common)

Layer	Open Source	Commercial
Metrics collection	Prometheus + node_exporter, IPMI exporter	Datadog Agent, Telegraf
Visualization	Grafana	Datadog Dashboards, Splunk
Alerting	Alertmanager, PagerDuty (integration)	Datadog Alerts, OpsGenie
DCIM / Inventory	NetBox, OpenDCIM	Nlyte, Sunbird, Device42
Log aggregation	Loki + Promtail, OpenSearch	Datadog Logs, Splunk

Per-Rack Dashboard (Recommended)

Build a per-rack view showing: power draw (A + B feeds), inlet temp, outlet temp, live alerts, hardware fault count, and top-consuming servers. This lets an on-call engineer assess rack health at a glance during an incident without needing multiple tool windows.

Section 11

🤖Automation & Source of Truth

The Automation Stack

Tool	Role	What It Manages
NetBox	Source of Truth / DCIM	Inventory, IP addresses, VLANs, rack layouts, cables
MAAS	Bare metal provisioning	PXE boot, OS install, cloud-init configuration
Ansible	Configuration management	OS hardening, package state, service config, idempotent enforcement
Terraform	Infrastructure as Code	Cloud resources, network device config (via providers)
Foreman/Satellite	Lifecycle management	Provisioning + configuration + patch management (alternative to MAAS+Ansible)

Ideal Provisioning Flow

          1
          📝
          NetBox
          Define rack, device, IP, VLAN in NetBox as the record of intent
        

          2
          ⚡
          Physical Install
          Rack server, connect power (A+B), connect OOB and data NICs, label everything
        

          3
          🔄
          MAAS / PXE
          Server boots via PXE, MAAS commissions it, OS installed via cloud-init
        

          4
          🔧
          Ansible
          Runs playbooks to enforce desired state: packages, users, sysctl, monitoring agents
        

          5
          📊
          Monitoring
          Node auto-registered in Prometheus/Datadog via service discovery; alerts active
        

Automation Principle

If you do it more than twice manually, automate it. If it takes more than 30 minutes to provision a new server, your automation is incomplete. Target: zero-touch provisioning from power-on to production-ready.

NetBox as Source of Truth

NetBox should be the authoritative record for: every rack, every device, every IP address, every VLAN, and every cable. All automation tooling reads from NetBox as its source of truth, not from ad-hoc scripts or spreadsheets. Treat NetBox updates as part of the change process — no physical change without a NetBox update.

Section 12

🔥Failure Domains & Risk Management

Failure Mode Matrix

Layer	Failure Mode	Impact	Detection	Mitigation
Facility Power	Utility grid failure	Total DC outage	UPS on-battery alert	Generator, UPS, 2N feeds
Rack Power	PDU circuit trip	Single rack partial/full loss	PDU alert, server offline	Dual PDU, A+B feeds, 80% rule
Cooling	CRAC unit failure	Rising inlet temps, thermal throttle	Temperature sensors, alerts	N+1 CRAC units, hot-aisle containment
Cooling	Airflow blockage	Hot spots, localized throttle	Rack inlet sensor spike	Blanking panels, cable management
Network	ToR switch failure	Rack network loss	BGP/OSPF neighbor drop	MLAG dual-homing to two ToR switches
Network	Spine switch failure	Fabric partial capacity	ECMP path count drop	N+2 spine switches, fast failover
Hardware	Disk failure	Data degradation (RAID), potential data loss	SMART alerts, RAID events	RAID 10/6, prompt replacement, spare pool
Hardware	PSU failure	Potential server shutdown	IPMI PSU fault alert	Dual PSU on A+B feeds
Human	Wrong cable unplugged	Unexpected outage	Network/server alert	Clear labeling, change management, lockout tags
Human	Unauthorized access	Security breach, data theft	Access log anomaly, CCTV	Dual-factor, escort policy, audit logs

Failure Domain Design

A failure domain is the blast radius of a single failure. The goal is to minimize the blast radius by isolating components at appropriate granularity:

Per-server — dual PSU, redundant NICs
Per-rack — dual PDUs on separate feeds; dual ToR switches via MLAG
Per-row — in-row cooling, separate power circuit
Per-cage — dedicated cage power feeds, separate cooling zone
Per-datacenter — geographic redundancy, site-to-site replication

Common Pitfall

A/B power redundancy only works if the A and B feeds are truly independent — separate utility feeds, separate UPS systems, separate PDU breakers. If both feeds share a single UPS or a single upstream breaker, you have the appearance of redundancy without the reality.

Section 13

🧾Operational Checklists

Daily Checks

Review active alerts (power, temperature, network, hardware) — triage any P2+ alerts
Verify UPS status — all units on utility, no battery-only conditions
Check temperature dashboard — no rack inlets above 27°C
Verify all critical servers are reachable (ping / OOB connectivity)
Review access logs for unexpected cage entry events

Weekly Checks

Review power capacity — no circuit above 70% average utilization
Review network bandwidth utilization — no sustained link above 70%
Check disk health dashboard — any drives with elevated SMART errors
Verify cable integrity — no loose or unlabeled cables observed during any site visit
Review OOB (BMC/IPMI) reachability — all nodes responding

Monthly Checks

Audit physical inventory vs NetBox — walk the floor, reconcile discrepancies
Test failover paths — intentionally fail one PDU feed and verify A/B redundancy
Review and rotate credentials — PDU admin passwords, BMC accounts
Check spare parts inventory — drives, PSUs, cables, transceivers
Review access list — remove departed personnel

Quarterly Checks

Disaster recovery simulation — full tabletop or live failover test
Firmware updates — server BIOS, BMC/iDRAC, NIC firmware, switch OS
Capacity planning review — project 6-month power, space, and network growth
Review and update runbooks — ensure documentation reflects current architecture
Compliance access review — formal review of all physical and logical access rights

Section 14

🧠Advanced Concepts

Capacity Planning

Capacity planning is the practice of ensuring your cage has sufficient headroom — in power, cooling, space, and network — to accommodate planned growth without emergency procurement. Key metrics:

Power budget per rack — track committed vs available kW per circuit; alert when committed reaches 70% of rated capacity
Network oversubscription ratios — leaf uplink bandwidth ÷ server-facing port bandwidth. Typical: 4:1 (standard compute), 1:1 or 2:1 (latency-sensitive or storage)
Space (U) utilization — per-rack and per-cage tracking; maintain 30% headroom for emergency rack swaps
Cooling envelope — calculate actual heat load vs CRAC/CRAH rated capacity; target <70% utilization of cooling capacity

High-Density / AI Racks

GPU and AI accelerator racks (NVIDIA DGX, AMD Instinct clusters) present fundamentally different engineering challenges than standard compute:

Power Density

20–100 kW

Per rack; vs 5–15 kW standard

Cooling Method

DLC / Rear-door

Air cooling insufficient above ~25 kW/rack

Network

200G–400G HDR/NDR

InfiniBand or 400GbE for GPU fabric

Floor Load

1200+ kg/rack

Verify structural floor capacity

Cost Awareness

Cost Category	Type	Typical Range	Optimization Lever
Colo space (per cabinet/mo)	OPEX	$800–$3000/cabinet	Density optimization, right-sizing
Power (per kW/mo committed)	OPEX	$50–$150/kW	Improve PUE, decommission idle servers
Cross-connect (per port/mo)	OPEX	$300–$1000/port	Consolidate carriers, use transit IX
Server hardware	CAPEX	$3k–$50k/server	Bulk procurement, lease vs buy analysis
Network hardware	CAPEX	$5k–$200k/switch	Whitebox alternatives, open networking

Reliability Engineering

MTBF (Mean Time Between Failures) — average time between hardware failures. Higher MTBF = more reliable hardware. Use vendor-published MTBF as a planning input, not a guarantee.
MTTR (Mean Time To Repair/Restore) — average time to restore service after failure. This is the metric you control operationally: good runbooks, spare parts on-site, and practiced procedures reduce MTTR.
Availability = MTBF ÷ (MTBF + MTTR). A device with 10,000h MTBF and 2h MTTR has 99.98% availability.
SLOs for infrastructure — define and track SLOs for power availability, network uptime, and provisioning time. Without SLOs, you cannot tell if your infrastructure is improving.

Reliability Insight

MTTR matters more than MTBF at scale. In a fleet of 1,000 servers, you will have hardware failures every week regardless of MTBF. The question is not if something fails — it is how fast you detect and recover. Invest in detection and automation proportionally to your fleet size.

Datacenter Cage // Engineering Guide