🧠Mental Model: What You're Managing
A datacenter cage is a locked, dedicated enclosure within a larger multi-tenant colocation facility. Colocation (colo) providers sell space, power, and connectivity — you own the hardware. The cage is your controlled micro-environment inside their building.
Think of it as a system with five distinct but interdependent layers. Failures in lower layers cascade upward; over-provisioning in one layer cannot compensate for a bottleneck in another.
Core Mandate
Ensure availability, safety, performance, and traceability across all five layers. Your job is to ensure no bottleneck, no failure, no silent degradation.SRE Mapping
If you think in pipelines, the datacenter maps cleanly:
Each "pipeline" has an input, a capacity, a failure mode, and a monitoring signal. Treat them accordingly.
🧱Cage Design & Layout
A cage is typically a welded or bolted mesh enclosure (floor-to-ceiling or raised) within a shared colo floor. Solid-wall cages offer more security; mesh cages allow easier visual inspection and air flow monitoring. Most enterprise deployments use mesh with solid-panel overlays for higher-security zones.
Core Design Concepts
- Cold aisle / hot aisle containment — servers face the cold aisle (cool intake air), exhaust into the hot aisle. Containment systems prevent mixing of cold and hot air streams, dramatically improving cooling efficiency (PUE impact: ~0.1–0.2).
- Density planning (kW per rack) — standard enterprise racks run 5–10 kW; GPU/AI racks can reach 20–80 kW. Know your density before placing hardware. Overcrowding causes thermal runaway.
- Growth planning — maintain 30–50% free rack space and 20–30% free power headroom at all times. Running at 90%+ capacity removes your buffer for planned maintenance or emergency rerouting.
Physical Components
Layout Principles
- Centralize network racks to minimize cable runs and latency
- Separate compute, storage, and network into distinct rack zones
- Isolate high-density GPU/AI racks — they require dedicated cooling circuits
- Keep critical infrastructure (PDUs, OOB switches) accessible without disturbing production racks
Layout Rule
Minimize cable length, maximize airflow, and isolate failure domains. Every layout decision is a trade-off between these three.Uptime Institute Tier Classification
Colo facilities are rated Tier I–IV. Your cage's reliability ceiling is determined by the facility tier.
| Tier | Redundancy | Uptime SLA | Annual Downtime |
|---|---|---|---|
| Tier I | No redundancy | 99.671% | ≤28.8 hrs |
| Tier II | Redundant components | 99.741% | ≤22 hrs |
| Tier III | N+1, concurrently maintainable | 99.982% | ≤1.6 hrs |
| Tier IV | 2N, fault tolerant | 99.995% | ≤26.3 min |
🗄️Rack Engineering
Standard Rack Specifications
Vertical Rack Layout (Top → Bottom)
This ordering optimizes airflow, cable management, and center of gravity:
Common Mistake
Never fill all U-spaces. Leave at least 1U blanking panel for every unused space — air leakage through empty slots can reduce cooling efficiency by 10–30%. Always fill gaps with blanking panels.U-Space Planning
Track U-space in your DCIM or CMDB (e.g., NetBox). Each device consumes a specific U count. Common examples:
| Device Type | Typical U | Example |
|---|---|---|
| 1U server | 1U | Dell PowerEdge R650, Supermicro 1029U |
| 2U server | 2U | Dell R750, HPE ProLiant DL380 Gen10 |
| 4U GPU server | 4U | NVIDIA DGX A100 (6U), Supermicro SYS-420GP |
| ToR switch | 1U | Arista 7050X3, Cisco Nexus 93180YC |
| Patch panel (24-port) | 1U | Leviton, Panduit |
| KVM switch | 1U | Raritan, Raritan Dominion |
⚡Power Systems
Power is the most critical physical resource in a cage. A power failure cascades instantly to every system; no redundancy elsewhere compensates for lost power. Understand every step in the power chain.
Power Chain
ATS = Automatic Transfer Switch; STS = Static Transfer Switch; UPS = Uninterruptible Power Supply; PDU = Power Distribution Unit
Redundancy Model
Enterprise standard is A/B power feeds — two completely independent paths from separate utility feeds, separate UPS systems, separate PDUs, to dual PSUs in each server.
Hard Rule
Never exceed 80% of any circuit's rated capacity (NEC 80% rule for continuous loads). A 20A circuit = 16A usable. A 30A circuit = 24A usable. Sustained loads above 80% risk breaker trips and thermal damage.Power Concepts
kW vs kVA
kVA is apparent power (what the PDU is rated for); kW is real power (what the server actually consumes). The ratio is the power factor (PF). Modern server PSUs typically have PF ≥ 0.95, so kW ≈ kVA × 0.95. Always plan budgets in kW (actual consumption), but size circuits in kVA (what the PDU/breaker must handle).
Phase Balancing
Three-phase power is standard in datacenters. Distribute load evenly across phases (L1, L2, L3) to avoid overloading a single phase. Target imbalance < 10% between phases. Measure at the PDU breaker level.
Power Measurement Tools
- Smart PDUs (e.g., Raritan PX3, APC AP8900 series) — per-outlet monitoring, remote switching
- DCIM platforms (e.g., Nlyte, Sunbird) — aggregate power dashboards
- IPMI/BMC — per-server power consumption via Redfish or IPMI 2.0
PUE (Power Usage Effectiveness)
PUE = Total Facility Power ÷ IT Equipment Power. A PUE of 1.0 is theoretical perfection; modern facilities target 1.2–1.4. A PUE of 2.0 means as much energy is wasted on overhead (cooling, lighting) as powers IT gear — unacceptable by today's standards.
🌬️Cooling & Airflow
Thermal management is the second critical physical resource. Servers tolerate brief power interruption via UPS; they tolerate almost no thermal excursion — CPU throttling begins at ~70°C and emergency shutdown typically triggers at 85–95°C.
Cooling Models
| Model | How It Works | Best For | Limitation |
|---|---|---|---|
| Cold Aisle Containment (CAC) | Enclose the cold aisle; servers draw cool air from inside the containment | Standard compute, 5–15 kW/rack | Hot exhaust enters open data hall |
| Hot Aisle Containment (HAC) | Enclose the hot aisle; hot air is captured and returned directly to CRACs | Higher efficiency than CAC, same density | Hot aisle is inaccessible during operation |
| In-Row Cooling | Cooling units placed between racks; cool air delivered at row level | High-density rows, GPU clusters | Higher CAPEX per kW cooled |
| Rear-Door Heat Exchanger | Chilled water coil in rack rear door absorbs server exhaust directly | Very high density (15–40 kW/rack) | Requires chilled water plumbing per rack |
| Direct Liquid Cooling (DLC) | Cold plates on CPUs/GPUs with liquid coolant; near-zero air cooling | AI/ML racks, 40–100+ kW/rack | Complex plumbing, leak risk, higher cost |
ASHRAE Thermal Guidelines
ASHRAE TC 9.9 defines environmental envelopes for IT equipment:
Best Practices
- Blanking panels — fill every unused rack U. Without them, cold air short-circuits from cold aisle through the rack to hot aisle without cooling anything
- Cable cutout seals — use brush strips or grommets on raised floor cutouts to prevent hot-cold air mixing under-floor
- Airflow monitoring — deploy temperature sensors at rack inlet (U1–U3) and outlet (top of rack). Alert on inlet >27°C or delta T >15°C
- CRAC vs CRAH — CRAC (Computer Room Air Conditioner) uses DX refrigerant; CRAH (Computer Room Air Handler) uses chilled water. CRAH is more efficient at scale, requires chiller plant
Efficiency Tip
Every 1°C rise in server inlet temperature allows CRAC setpoint to rise ~1°C, reducing cooling energy ~2–4%. Raising setpoint from 18°C to 27°C can cut cooling energy by 15–40%.🔌Cabling & Connectivity
Network Cable Types
| Type | Max Distance | Speed | Use Case |
|---|---|---|---|
| Cat6 | 55m @ 10G | Up to 10 GbE | Short server-to-ToR runs |
| Cat6A | 100m @ 10G | Up to 10 GbE | Standard structured cabling |
| DAC (Direct Attach Copper) | 1–7m | 25G / 40G / 100G | Server-to-ToR, very short high-speed |
| AOC (Active Optical Cable) | Up to 100m | 25G / 100G / 400G | Cross-cage, spine interconnects |
| OM3/OM4 Fiber (Multi-mode) | 300m / 400m @ 10G | Up to 100G | Within same facility floor |
| OS2 Fiber (Single-mode) | 10+ km | Up to 400G+ | Cross-facility, long-haul backbone |
Power Cable Standards
- C13/C14 — standard IEC 60320, up to 10A/15A; most 1U–2U servers
- C19/C20 — heavy-duty IEC 60320, up to 16A/20A; high-power servers, GPUs, storage
- NEMA L6-20 / L6-30 — locking connectors, used for PDU feeds in North American facilities
- IEC 60309 — industrial connectors (16A/32A), common in European colo facilities
Cable Management Principles
- Use horizontal cable managers (1U arm) at every patch panel and switch level
- Vertical cable trays on both sides of the rack for power and data segregation
- Label both ends of every cable — source and destination. Use consistent scheme:
RACK-RU-PORT - Velcro ties, not zip ties — zip ties crush cable jackets and make changes destructive
- Color coding example: Blue = data, Red = management/OOB, Yellow = SAN/storage, Black = power
- Separate power and data cables in different trays to minimize EMI
Operations Rule
If you cannot trace a cable's path and purpose in under 10 seconds, your cable management is failing. Fix it before an incident forces you to do it under pressure.Fiber Optic Handling
- Never exceed the minimum bend radius (typically 10× cable diameter for fiber)
- Use LC connectors for most server/switch connections; MPO/MTP for high-density breakout
- Clean fiber end-faces before every insertion — a single dirty connector can degrade link performance or cause intermittent errors
- Document fiber runs in your DCIM: port-to-port mapping, connector type, length, dB loss measurement
🖥️Server Components
Key Subsystems
| Subsystem | Key Spec | Failure Signal | Monitoring Method |
|---|---|---|---|
| CPU | TDP, core count, frequency | Thermal throttle, NMI, MCE | IPMI sensors, OS metrics |
| RAM (DRAM) | ECC DDR5, speed, capacity | Correctable/uncorrectable ECC errors | edac-util, IPMI, vendor BMC |
| NVMe / SSD | DWPD, capacity, latency | SMART errors, reallocated sectors | smartctl, nvme-cli |
| NIC | Speed (25G/100G), port count | CRC errors, packet drops, link flap | ethtool, SNMP, DCIM |
| PSU | Wattage, efficiency (80 Plus) | PSU fault LED, IPMI alert | IPMI, Redfish, smart PDU |
| BMC/IPMI | Out-of-band management chip | inaccessible console | Regular connectivity checks |
| Fans | RPM, airflow CFM | Fan fault LED, high inlet temp | IPMI sensor polling |
Hot-Swap Components
These can be replaced without powering down the server (vendor-dependent — always verify):
- PSU (when redundant)
- Drives in a RAID array (with proper RAID rebuild procedure)
- Fans (in most enterprise-grade servers)
- NIC in PCIe hot-swap bays (rare; OCP 3.0 mezzanine cards on some platforms)
IPMI / BMC / Redfish
Every enterprise server has an out-of-band management interface separate from the main OS network. Vendor naming varies: iDRAC (Dell), iLO (HPE), IPMI/BMC (Supermicro, Lenovo). The modern standard API is Redfish (RESTful HTTPS, replaces legacy IPMI 2.0 LAN commands).
- Always put BMC/iDRAC on a dedicated management VLAN (OOB network)
- Set IPMI access to management network only — never expose to the internet
- Use Redfish for programmatic provisioning and sensor polling
OOB Discipline
An unreachable BMC is a server you cannot recover without a physical site visit. Test OOB connectivity for every server at deploy time and include it in your monitoring. Down BMC = P2 incident in your runbook.80 Plus Efficiency Standards
| Certification | Min Efficiency @ 20% | Min Efficiency @ 50% | Min Efficiency @ 100% |
|---|---|---|---|
| 80 Plus Bronze | 82% | 85% | 82% |
| 80 Plus Gold | 87% | 90% | 87% |
| 80 Plus Platinum | 90% | 92% | 89% |
| 80 Plus Titanium | 90% | 94% | 91% |
🌐Networking Architecture
Three-Tier vs Leaf-Spine
Traditional three-tier (access → aggregation → core) was designed for client-server traffic patterns with most traffic going north-south (user → server). It is increasingly inadequate for modern east-west traffic-heavy workloads (server-to-server, distributed systems, microservices).
Leaf-Spine Architecture (Modern Standard)
Every leaf switch connects to every spine switch. No leaf-to-leaf links. This provides predictable, low-latency, and equal-cost paths between any two servers in the fabric.
- ECMP (Equal-Cost Multi-Path) — multiple equal-cost paths are load-balanced, increasing effective bandwidth and providing automatic failover
- BGP as the fabric underlay — eBGP is increasingly used as the routing protocol within the datacenter fabric (RFC 7938)
- VXLAN overlay — tunneling protocol that extends Layer 2 segments over Layer 3 underlay; enables VM/workload mobility across the fabric
- Typical oversubscription — leaf ports (server-facing) to spine uplinks: 3:1 to 6:1 for standard compute; 1:1 for latency-sensitive workloads
Key Protocols & Technologies
| Technology | Layer | Purpose |
|---|---|---|
| VLANs (802.1Q) | L2 | Traffic segmentation within a switch/fabric |
| LACP / MLAG | L2 | Link aggregation; dual-homing servers to two ToR switches |
| OSPF | L3 | Interior gateway protocol; often used in smaller fabrics |
| BGP (eBGP) | L3 | Preferred underlay routing in large leaf-spine fabrics |
| VXLAN (RFC 7348) | L3 overlay | Extend L2 domains over L3 routed fabric |
| BFD | L3 | Sub-second failure detection for BGP/OSPF sessions |
| RDMA / RoCE | Transport | Low-latency networking for storage and HPC workloads |
Design Principle
The network is the nervous system — design for predictable latency first, then redundancy, then throughput. A flapping link that comes and goes is worse than a link that is consistently down — the former causes intermittent application errors that are hard to diagnose.🔐Physical & Logical Security
Physical Security Layers
- Cage locks — electronic locks with audit logging; dual-factor (badge + PIN or biometric) for critical cages
- Access logs — every cage entry/exit must be logged with timestamp and identity; retain for 90+ days minimum, often 12 months for compliance
- Cameras — minimum coverage: cage entrance, all rack fronts. Motion-triggered recording; retain 30–90 days of footage
- Escort policies — vendor technicians and colo staff must be escorted by your personnel; never allow unescorted access to your cage
- Tamper-evident labels — on server panels and drive bays to detect unauthorized component access
- Asset tagging — RFID or barcode on every device; reconcile against DCIM inventory quarterly
Logical Security
- Dedicated management network (OOB) — BMC/iDRAC on a separate VLAN/subnet, isolated from production traffic, accessible only via jump host
- Jump hosts (bastions) — all SSH/HTTPS access to servers routed through hardened bastion hosts with MFA and full session logging
- Network segmentation — firewall between production, management, storage, and public network zones
- Firmware / BIOS passwords — prevent unauthorized boot device changes or BIOS configuration
- Secure boot — enabled on all servers to prevent boot-time malware
- Drive encryption — full-disk encryption (AES-256) on all drives, especially in shared or multi-tenant environments
Zero Trust Principles
Apply zero-trust concepts even within the physical cage:
- No implicit trust based on physical location (being inside the cage does not grant network access)
- Authenticate every access — network, management plane, and physical
- Least-privilege access — operators have access to only the racks they manage
- Audit everything — access logs, CLI session recordings, change tickets
Compliance Note
Many frameworks (SOC 2, PCI-DSS, ISO 27001) require documented physical access controls, audit logs, and quarterly access reviews. Build these processes before your first audit, not during it.📊Monitoring & Observability
Facility-Level Monitoring
- PDU power draw — per-outlet kW, amps, power factor; alert on >80% circuit utilization
- Temperature & humidity — rack-level sensors at U1 (inlet) and top-of-rack (exhaust); alert on inlet >27°C
- Airflow — differential pressure sensors on cold aisle containment
- UPS status — battery health, bypass status, runtime remaining
- Generator test status — last tested, fuel level (facility-provided, but monitor
Hardware-Level Monitoring
- CPU temperature — via IPMI/Redfish; alert if CPU package temp >70°C sustained
- Memory ECC errors — correctable ECC errors are warning; uncorrectable = critical, immediate replacement
- Disk health — SMART attributes (reallocated sectors, pending sectors, uncorrectable errors); NVMe wear indicators
- Fan RPM — alert on fan failure or RPM significantly below expected
- PSU status — fault/OK per PSU via IPMI
- NIC errors — CRC errors, input errors, drops; alert on sustained non-zero error rates
- BMC connectivity — alert if OOB/IPMI unreachable (cannot manage the server remotely)
Observability Stack (Common)
| Layer | Open Source | Commercial |
|---|---|---|
| Metrics collection | Prometheus + node_exporter, IPMI exporter | Datadog Agent, Telegraf |
| Visualization | Grafana | Datadog Dashboards, Splunk |
| Alerting | Alertmanager, PagerDuty (integration) | Datadog Alerts, OpsGenie |
| DCIM / Inventory | NetBox, OpenDCIM | Nlyte, Sunbird, Device42 |
| Log aggregation | Loki + Promtail, OpenSearch | Datadog Logs, Splunk |
Per-Rack Dashboard (Recommended)
Build a per-rack view showing: power draw (A + B feeds), inlet temp, outlet temp, live alerts, hardware fault count, and top-consuming servers. This lets an on-call engineer assess rack health at a glance during an incident without needing multiple tool windows.
🤖Automation & Source of Truth
The Automation Stack
| Tool | Role | What It Manages |
|---|---|---|
| NetBox | Source of Truth / DCIM | Inventory, IP addresses, VLANs, rack layouts, cables |
| MAAS | Bare metal provisioning | PXE boot, OS install, cloud-init configuration |
| Ansible | Configuration management | OS hardening, package state, service config, idempotent enforcement |
| Terraform | Infrastructure as Code | Cloud resources, network device config (via providers) |
| Foreman/Satellite | Lifecycle management | Provisioning + configuration + patch management (alternative to MAAS+Ansible) |
Ideal Provisioning Flow
Automation Principle
If you do it more than twice manually, automate it. If it takes more than 30 minutes to provision a new server, your automation is incomplete. Target: zero-touch provisioning from power-on to production-ready.NetBox as Source of Truth
NetBox should be the authoritative record for: every rack, every device, every IP address, every VLAN, and every cable. All automation tooling reads from NetBox as its source of truth, not from ad-hoc scripts or spreadsheets. Treat NetBox updates as part of the change process — no physical change without a NetBox update.
🔥Failure Domains & Risk Management
Failure Mode Matrix
| Layer | Failure Mode | Impact | Detection | Mitigation |
|---|---|---|---|---|
| Facility Power | Utility grid failure | Total DC outage | UPS on-battery alert | Generator, UPS, 2N feeds |
| Rack Power | PDU circuit trip | Single rack partial/full loss | PDU alert, server offline | Dual PDU, A+B feeds, 80% rule |
| Cooling | CRAC unit failure | Rising inlet temps, thermal throttle | Temperature sensors, alerts | N+1 CRAC units, hot-aisle containment |
| Cooling | Airflow blockage | Hot spots, localized throttle | Rack inlet sensor spike | Blanking panels, cable management |
| Network | ToR switch failure | Rack network loss | BGP/OSPF neighbor drop | MLAG dual-homing to two ToR switches |
| Network | Spine switch failure | Fabric partial capacity | ECMP path count drop | N+2 spine switches, fast failover |
| Hardware | Disk failure | Data degradation (RAID), potential data loss | SMART alerts, RAID events | RAID 10/6, prompt replacement, spare pool |
| Hardware | PSU failure | Potential server shutdown | IPMI PSU fault alert | Dual PSU on A+B feeds |
| Human | Wrong cable unplugged | Unexpected outage | Network/server alert | Clear labeling, change management, lockout tags |
| Human | Unauthorized access | Security breach, data theft | Access log anomaly, CCTV | Dual-factor, escort policy, audit logs |
Failure Domain Design
A failure domain is the blast radius of a single failure. The goal is to minimize the blast radius by isolating components at appropriate granularity:
- Per-server — dual PSU, redundant NICs
- Per-rack — dual PDUs on separate feeds; dual ToR switches via MLAG
- Per-row — in-row cooling, separate power circuit
- Per-cage — dedicated cage power feeds, separate cooling zone
- Per-datacenter — geographic redundancy, site-to-site replication
Common Pitfall
A/B power redundancy only works if the A and B feeds are truly independent — separate utility feeds, separate UPS systems, separate PDU breakers. If both feeds share a single UPS or a single upstream breaker, you have the appearance of redundancy without the reality.🧾Operational Checklists
Daily Checks
- Review active alerts (power, temperature, network, hardware) — triage any P2+ alerts
- Verify UPS status — all units on utility, no battery-only conditions
- Check temperature dashboard — no rack inlets above 27°C
- Verify all critical servers are reachable (ping / OOB connectivity)
- Review access logs for unexpected cage entry events
Weekly Checks
- Review power capacity — no circuit above 70% average utilization
- Review network bandwidth utilization — no sustained link above 70%
- Check disk health dashboard — any drives with elevated SMART errors
- Verify cable integrity — no loose or unlabeled cables observed during any site visit
- Review OOB (BMC/IPMI) reachability — all nodes responding
Monthly Checks
- Audit physical inventory vs NetBox — walk the floor, reconcile discrepancies
- Test failover paths — intentionally fail one PDU feed and verify A/B redundancy
- Review and rotate credentials — PDU admin passwords, BMC accounts
- Check spare parts inventory — drives, PSUs, cables, transceivers
- Review access list — remove departed personnel
Quarterly Checks
- Disaster recovery simulation — full tabletop or live failover test
- Firmware updates — server BIOS, BMC/iDRAC, NIC firmware, switch OS
- Capacity planning review — project 6-month power, space, and network growth
- Review and update runbooks — ensure documentation reflects current architecture
- Compliance access review — formal review of all physical and logical access rights
🧠Advanced Concepts
Capacity Planning
Capacity planning is the practice of ensuring your cage has sufficient headroom — in power, cooling, space, and network — to accommodate planned growth without emergency procurement. Key metrics:
- Power budget per rack — track committed vs available kW per circuit; alert when committed reaches 70% of rated capacity
- Network oversubscription ratios — leaf uplink bandwidth ÷ server-facing port bandwidth. Typical: 4:1 (standard compute), 1:1 or 2:1 (latency-sensitive or storage)
- Space (U) utilization — per-rack and per-cage tracking; maintain 30% headroom for emergency rack swaps
- Cooling envelope — calculate actual heat load vs CRAC/CRAH rated capacity; target <70% utilization of cooling capacity
High-Density / AI Racks
GPU and AI accelerator racks (NVIDIA DGX, AMD Instinct clusters) present fundamentally different engineering challenges than standard compute:
Cost Awareness
| Cost Category | Type | Typical Range | Optimization Lever |
|---|---|---|---|
| Colo space (per cabinet/mo) | OPEX | $800–$3000/cabinet | Density optimization, right-sizing |
| Power (per kW/mo committed) | OPEX | $50–$150/kW | Improve PUE, decommission idle servers |
| Cross-connect (per port/mo) | OPEX | $300–$1000/port | Consolidate carriers, use transit IX |
| Server hardware | CAPEX | $3k–$50k/server | Bulk procurement, lease vs buy analysis |
| Network hardware | CAPEX | $5k–$200k/switch | Whitebox alternatives, open networking |
Reliability Engineering
- MTBF (Mean Time Between Failures) — average time between hardware failures. Higher MTBF = more reliable hardware. Use vendor-published MTBF as a planning input, not a guarantee.
- MTTR (Mean Time To Repair/Restore) — average time to restore service after failure. This is the metric you control operationally: good runbooks, spare parts on-site, and practiced procedures reduce MTTR.
- Availability = MTBF ÷ (MTBF + MTTR). A device with 10,000h MTBF and 2h MTTR has 99.98% availability.
- SLOs for infrastructure — define and track SLOs for power availability, network uptime, and provisioning time. Without SLOs, you cannot tell if your infrastructure is improving.
Reliability Insight
MTTR matters more than MTBF at scale. In a fleet of 1,000 servers, you will have hardware failures every week regardless of MTBF. The question is not if something fails — it is how fast you detect and recover. Invest in detection and automation proportionally to your fleet size.