Learning path

HPC networking from hardware to routed AI fabrics

Browse the full path in public. Sign in to open chapters and labs, sync progress, and participate in the technical discussion.

chapters mapped

labs ready to launch

Guest

public browsing mode

Part 1 - Foundations

No prerequisites - Ch 0-2

3 chapters

The Hardware Story

Physical layer orientation. What an HCA is, why NICs became DPUs, how a DGX node is wired, the three separate networks.

40 minSign in required

HardwareDGXNVLinkDPURail topology

Operating Systems and Management Platforms

What runs on every device. How you access it after power-on. The management philosophy. CLI vs orchestrated. First power-on sequence.

35 minSign in required

DGX OSONYXCumulusUFMPower-onFirst access

Why HPC Networking Is Different

The AllReduce barrier, why TCP fails, tail latency math, and the mental model shift from enterprise to AI networking.

25 minSign in required

AllReduceLosslessRDMAJCTTail latency

Part 2 - Fabric Operations

Requires Part 1 - Ch 3-8

6 chapters

The CLI - Reading the Fabric

The commands and discipline for reading HPC fabric state. Which commands run where, how to read their output, and the investigation workflow from physical layer to configuration.

45 minSign in required

ibstatshow dcb pfcethtoolDiagnostic workflow

InfiniBand Operations - ONYX CLI and Fabric Management

The InfiniBand operations layer: ONYX CLI, error counter interpretation, Subnet Manager management, ibdiagnet fabric sweep, and UFM event correlation.

50 minSign in required

ONYXibdiagnetUFMSubnet ManagerError counters

PFC, ECN, and Congestion Control

How losslessness actually works: PFC mechanics at the wire level, pause storm formation, ECN CE bit marking, DCQCN rate control algorithm, and the complete RoCEv2 port configuration checklist.

55 minSign in required

PFCECNDCQCNPause stormCongestion

Efficient Load Balancing

Why AI traffic is structurally low-entropy and how that breaks ECMP. The four load balancing modes (SLB, DLB, GLB, sDLB). Per-packet spraying and RSHP. In-cast congestion patterns and how to diagnose them from spine utilisation counters.

50 minSign in required

ECMPDLBGLBRSHPFlowletsIn-castElephant flows

Topology Design

How AI fabric scales from one switch to a SuperPOD. Fat-tree topology math, bisection bandwidth, BasePOD vs SuperPOD reference designs, oversubscription calculations, ROD vs RUD wiring, switch buffer selection, and cabling constraints.

60 minSign in required

Fat-treeBasePODSuperPODOversubscriptionRODRUDCabling

NCCL - The Application Layer

How NCCL translates AllReduce into RDMA operations. Ring vs Tree vs Double-Binary Tree algorithms. The environment variables that determine whether NCCL finds RDMA or falls back to TCP. Reading nccl-tests output. Correlating busbw degradation to fabric diagnostics.

55 minSign in required

NCCLAllReducebusbwnccl-testsDCQCN tuning

Part 3 - Physical Layer and Infrastructure

Requires Part 2 - Ch 9-11

4 chapters

Optics, Cabling, and the Physical Layer

The physical layer beneath the fabric: 400G/800G optics, DSPs, fiber types, form factors, cable selection, and why signal integrity and power density now shape AI cluster design.

40 minSign in required

OpticsCablingFiberOSFPCPOSignal integrity

The Storage Fabric

The separate network that feeds and protects training: storage isolation, GDS data paths, NVMe-oF transports, parallel file systems, checkpoint economics, and storage topology choices.

45 minSign in required

StorageGDSNVMe-oFParallel file systemsCheckpointing

Monitoring, Telemetry, and Observability

Know about problems before the ML engineer's Slack message arrives. UFM REST API, DCGM GPU metrics, Prometheus alert design, threshold calibration, and cross-layer correlation across four monitoring streams.

48 minSign in required

UFM APIDCGMPrometheusGrafanaAlert calibrationCorrelation

GPU Hardware Generations

Network-relevant implications of GPU generations: NVLink/NVSwitch generation table, SXM vs PCIe form factors, GH200, H100 CNX, and Confidential Computing.

55 minSign in required

NVLinkSXMPCIeGH200MIG

Part 4 - Scale and Architecture

Requires Part 3 - Ch 12-16

9 chapters

Scale-Up Networking - NVLink Switch System

External NVLink Switch modules, 57.6 TB/s all-to-all at 256 GPUs, NVLink Network addressing, scale-up vs scale-out architecture decisions, and NVLink Switch diagnostics.

45 minSign in required

NVLink Switch57.6 TB/sScale-upNVLink NetworkSHARP

Alternative Topologies

Torus, folded torus, dragonfly, and TPU Pod design choices - where they came from, what workloads they suit, and why fat-tree remains dominant for AI training clusters.

45 minSign in required

TorusDragonflyGoogle TPUFat-tree

IP Routing for AI/ML Fabrics

How modern AI fabrics use routed Ethernet: BGP unnumbered, ASN design, BGP DPF, RIFT comparisons, Flex Algo, SRv6 path steering, and multi-tenant EVPN-VXLAN design.

55 minSign in required

BGPRIFTFlex AlgoSRv6Multi-tenancy

The GPU Compute Network - Packet Anatomy

A packet-level walkthrough from NCCL work queue entries to remote DMA completion: DGX interfaces, Queue Pair mechanics, ConnectX-7 processing, switch forwarding, and end-to-end packet decode.

55 minSign in required

Packet anatomyQueue PairsConnectX-7Leaf/spine forwardingWireshark

Storage Network Packet Path

A packet-level walkthrough of a checkpoint write from GPU HBM to storage appliance: NVMe-oF capsules, DMA paths, storage-fabric behavior, frame anatomy, and diagnostics.

55 minSign in required

NVMe-oFGDSPacket anatomyStorage fabricCheckpointing

OOB and Management Network

The out-of-band management fabric: BMC architecture, IPMI and Redfish internals, OOB topology, switch management isolation, UFM communication paths, BlueField-3 management architecture, and hardening guidance.

60 minSign in required

OOBBMCIPMIRedfishUFMManagement network

IP Addressing and Planning

A complete addressing reference for DGX BasePOD and SuperPOD deployments: address families, RFC 1918 partitioning, loopback design, P2P links, /32 server routes, management-plane planning, VXLAN VNI allocation, and scaling pitfalls.

65 minSign in required

IP addressingLoopbacksRFC1918BGP unnumberedVXLANSuperPOD

Ultra Ethernet Consortium (UEC)

Why RoCEv2 has friction at scale, and how UEC addresses it: UET packet format, SACK-based reliability without lossless fabric, 1-RTT congestion feedback, native multipath spraying, switch requirements, and honest deployment readiness as of March 2026.

60 minSign in required

UECUETPFC-freeSACK retransmitPacket sprayingNPM

Congestion Control Deep Dive

A rigorous, algorithm-level treatment of every congestion control scheme used in production AI fabrics — DCQCN, Swift, HPCC, TIMELY, and UEC CC — with practical guidance on parameter tuning and algorithm selection.

60 minSign in required

DCQCNSwiftHPCCTIMELYUEC CCCongestion ControlJCT

Part 5 - Advanced Networking

Requires Part 4 - Ch 22-23

2 chapters

Segment Routing for AI Fabrics

A practitioner's guide to deploying SRv6, SR-TE, EVPN+SRv6, and IS-IS Flex-Algo in production AI data centre fabrics.

65 minSign in required

SRv6SR-TEEVPNFlex-AlgoTraffic engineering

AI Networking Security

The complete security layer for AI fabrics: RDMA threat model, RoCEv2 RKEY protection, Spectrum-X GBP microsegmentation, InfiniBand PKey isolation, BlueField-3 as a security enforcement point, and UFM Cyber-AI anomaly detection.

65 minSign in required

RDMA securityGID filteringSpectrum-XPKeysUFM Cyber-AI

Part 6 - Platform Integration

Requires Part 5 - Ch 24+

4 chapters

Spectrum-X Architecture and the AI Factory Platform

NVIDIA Spectrum-X: Spectrum-4 ASIC, BlueField-3 SuperNIC, DOCA, NetQ, and the vertically integrated Ethernet platform behind AI factory fabrics.

45 minSign in required

Spectrum-XSpectrum-4BlueField-3NVUEDOCANetQAI factoryScalable Unit

RoCE Configuration and Operations on Spectrum-X

Configure RoCEv2 end to end on Spectrum-X: prerequisites, NVUE Day-0 workflow, QoS architecture, ECN/PFC tuning, and production verification.

55 minSign in required

RoCEv2Spectrum-XNVUEPFCECNDCQCNQoS

Adaptive Routing and Per-Packet Spraying on Spectrum-X

Why flow-based ECMP fails AI collectives, how Spectrum-4 Adaptive Routing reacts to queue depth, and how BF3 reorder buffers make per-packet spraying viable on Spectrum-X.

60 minSign in required

Adaptive RoutingSpectrum-4BlueField-3Per-packet ARResilient hashingTelemetry

BGP-EVPN Multi-Tenancy on Spectrum-X

How Spectrum-X enforces tenant isolation using BGP-EVPN, VXLAN, and GBP microsegmentation - from VNI planning to route-target configuration and operational troubleshooting.

60 minSign in required

BGP-EVPNVXLANSpectrum-XMulti-tenancyGBPTenant isolation

Labs

Scenario-based simulator work. Sign in to launch a lab and keep your troubleshooting progress attached to one account.

Need help choosing where to start?->

Fabric CLI orientation

A no-incident calibration lab. Learn how UFM, DGX OS, Linux netdev names, and Cumulus NVUE each describe the same two-rail fabric before you start troubleshooting failures.

12 minSign in required

CLIUFMDGX OSCumulusRail mapping

Read RoCE lossless counters

A healthy-link counter drill. Run a short RDMA write probe, then correlate switch-side PFC and RoCE counters with DGX ethtool counters so ECN marks, PFC pauses, and zero-drop behavior become familiar before incident labs.

15 minSign in required

PFCECNRoCE countersib_write_bw

Identify the failed rail

A GPU rail has gone dark. Use the topology map and CLI tools to identify which rail, confirm the RDMA state, and isolate whether the fault is on the NIC or switch side.

12 minSign in required

CLIPhysical diagnosticsRail state

Fix the PFC misconfiguration

A RoCEv2 workload is experiencing retransmissions. PFC is misconfigured. Diagnose and fix it.

10 minSign in required

PFCRoCEv2CLI

Diagnose fabric congestion

Throughput has dropped 40%. Investigate using interface counters and ECN configuration commands.

15 minSign in required

CongestionECNCounters

Diagnose uneven spine utilisation

AllReduce throughput has dropped with no drops visible. Diagnose why spine links are uneven and fix load balancing to restore full training throughput.

15 minSign in required

ECMPLoad balancingSpine counters

Evaluate topology proposals

Two vendors propose different switch configurations for a 64-node DGX H100 cluster. Calculate oversubscription ratios, identify which proposal meets requirements, and submit a recommendation before the purchase order is signed.

15 minSign in required

OversubscriptionCapacity planningTopology

Diagnose NCCL transport fallback

A 16-node cluster shows 3 GB/s busbw instead of expected performance. All hardware is healthy. Diagnose why NCCL is using socket transport and fix the environment variable misconfiguration.

20 minSign in required

NCCLTransport fallbackEnv vars

Triage a silent fabric degradation

Training is 12% slower - but no hard errors anywhere. Use UFM port counters, DCGM GPU metrics, and switch counters to correlate a rising pre-FEC BER across three monitoring layers and identify the marginal physical connector before it becomes a full link failure.

25 minSign in required

MonitoringDCGMUFMTelemetry

Uncover the hidden pause storm

A switch looks healthy at a glance, but the NIC reveals a severe pause storm. Check both ends, identify missing ECN, and restore rate control before continuous PFC pauses collapse throughput.

15 minSign in required

PFCPause stormECN

Fix the PFC priority mismatch

PFC is enabled, but on the wrong traffic class. Cross-check NIC drops, PFC priority output, and RoCE DSCP-to-priority mapping, then move PFC protection back to priority 3.

18 minSign in required

PFCPriority mappingRoCE

Recover the err-disabled rail

A rail has gone err-disabled after a physical fault. Recognise the NIC-side active-state trap, confirm the switch port failure, replace the optic, clear err-disable, and verify full recovery.

20 minSign in required

OpticsErr-disableRail recovery

ECMP hotspot: BGP bandwidth community

A reduced-capacity spine is still receiving equal ECMP traffic. Identify the missing BGP Link Bandwidth community and restore weighted ECMP before PFC storms spread.

22 minSign in required

BGPECMPWeighted paths

BGP suboptimal routing: spine ASN design

A link failure triggers a bad 3-hop path because the spines use different ASNs. Trace the suboptimal route, unify the spine ASN design, and verify clean failover behavior.

24 minSign in required

BGPFailoverASN design