Learn the fabricthat keeps AI clusters alive.
InfiniBand. RoCEv2. RDMA. Congestion control. Scale-out fabric design. FabricLab teaches AI and HPC networking through interactive chapters, stateful labs, and a simulator built for network engineers.
28
published chapters
21
scenario labs
Free
public access model
The gap
There is no Packet Tracer for HPC networking.
Network engineers who can troubleshoot BGP, reason about ECMP, and design VxLAN fabrics still walk into AI data centers and find an unfamiliar world. The knowledge is fragmented across vendor docs, conference talks, and incident writeups.
FabricLab turns that scattered knowledge into a structured, open, community-reviewed platform. Chapters explain the hardware and protocols. Labs let you test commands against live state. Anyone can contribute a correction, a new lab, or a sharper explanation.
21
scenario labs available in the simulator catalog
28
chapters currently published in the open catalog
Curriculum
A structured path from hardware to protocol.
28 chapters. 21 scenario labs. One simulator. All chapters are free to read. Sign in when you want synced progress and discussion.
Chapter 0
The Hardware Story
Physical layer orientation. What an HCA is, why NICs became DPUs, how a DGX node is wired, the three separate networks.
Chapter 1
Operating Systems and Management Platforms
What runs on every device. How you access it after power-on. The management philosophy. CLI vs orchestrated. First power-on sequence.
Chapter 2
Why HPC Networking Is Different
The AllReduce barrier, why TCP fails, tail latency math, and the mental model shift from enterprise to AI networking.
Chapter 3
The CLI - Reading the Fabric
The commands and discipline for reading HPC fabric state. Which commands run where, how to read their output, and the investigation workflow from physical layer to configuration.
Chapter 4
InfiniBand Operations - ONYX CLI and Fabric Management
The InfiniBand operations layer: ONYX CLI, error counter interpretation, Subnet Manager management, ibdiagnet fabric sweep, and UFM event correlation.
Chapter 5
PFC, ECN, and Congestion Control
How losslessness actually works: PFC mechanics at the wire level, pause storm formation, ECN CE bit marking, DCQCN rate control algorithm, and the complete RoCEv2 port configuration checklist.
Chapter 6
Efficient Load Balancing
Why AI traffic is structurally low-entropy and how that breaks ECMP. The four load balancing modes (SLB, DLB, GLB, sDLB). Per-packet spraying and RSHP. In-cast congestion patterns and how to diagnose them from spine utilisation counters.
Chapter 7
Topology Design
How AI fabric scales from one switch to a SuperPOD. Fat-tree topology math, bisection bandwidth, BasePOD vs SuperPOD reference designs, oversubscription calculations, ROD vs RUD wiring, switch buffer selection, and cabling constraints.
Chapter 8
NCCL - The Application Layer
How NCCL translates AllReduce into RDMA operations. Ring vs Tree vs Double-Binary Tree algorithms. The environment variables that determine whether NCCL finds RDMA or falls back to TCP. Reading nccl-tests output. Correlating busbw degradation to fabric diagnostics.
Chapter 9
Optics, Cabling, and the Physical Layer
The physical layer beneath the fabric: 400G/800G optics, DSPs, fiber types, form factors, cable selection, and why signal integrity and power density now shape AI cluster design.
Chapter 10
The Storage Fabric
The separate network that feeds and protects training: storage isolation, GDS data paths, NVMe-oF transports, parallel file systems, checkpoint economics, and storage topology choices.
Chapter 11
Monitoring, Telemetry, and Observability
Know about problems before the ML engineer's Slack message arrives. UFM REST API, DCGM GPU metrics, Prometheus alert design, threshold calibration, and cross-layer correlation across four monitoring streams.
Chapter 12
Scale-Up Networking - NVLink Switch System
External NVLink Switch modules, 57.6 TB/s all-to-all at 256 GPUs, NVLink Network addressing, scale-up vs scale-out architecture decisions, and NVLink Switch diagnostics.
Chapter 13
Alternative Topologies
Torus, folded torus, dragonfly, and TPU Pod design choices - where they came from, what workloads they suit, and why fat-tree remains dominant for AI training clusters.
Chapter 14
GPU Hardware Generations
Network-relevant implications of GPU generations: NVLink/NVSwitch generation table, SXM vs PCIe form factors, GH200, H100 CNX, and Confidential Computing.
Chapter 15
IP Routing for AI/ML Fabrics
How modern AI fabrics use routed Ethernet: BGP unnumbered, ASN design, BGP DPF, RIFT comparisons, Flex Algo, SRv6 path steering, and multi-tenant EVPN-VXLAN design.
Chapter 16
The GPU Compute Network - Packet Anatomy
A packet-level walkthrough from NCCL work queue entries to remote DMA completion: DGX interfaces, Queue Pair mechanics, ConnectX-7 processing, switch forwarding, and end-to-end packet decode.
Chapter 17
Storage Network Packet Path
A packet-level walkthrough of a checkpoint write from GPU HBM to storage appliance: NVMe-oF capsules, DMA paths, storage-fabric behavior, frame anatomy, and diagnostics.
Chapter 18
OOB and Management Network
The out-of-band management fabric: BMC architecture, IPMI and Redfish internals, OOB topology, switch management isolation, UFM communication paths, BlueField-3 management architecture, and hardening guidance.
Chapter 19
IP Addressing and Planning
A complete addressing reference for DGX BasePOD and SuperPOD deployments: address families, RFC 1918 partitioning, loopback design, P2P links, /32 server routes, management-plane planning, VXLAN VNI allocation, and scaling pitfalls.
Chapter 20
Ultra Ethernet Consortium (UEC)
Why RoCEv2 has friction at scale, and how UEC addresses it: UET packet format, SACK-based reliability without lossless fabric, 1-RTT congestion feedback, native multipath spraying, switch requirements, and honest deployment readiness as of March 2026.
Chapter 21
Congestion Control Deep Dive
A rigorous, algorithm-level treatment of every congestion control scheme used in production AI fabrics — DCQCN, Swift, HPCC, TIMELY, and UEC CC — with practical guidance on parameter tuning and algorithm selection.
Chapter 22
Segment Routing for AI Fabrics
A practitioner's guide to deploying SRv6, SR-TE, EVPN+SRv6, and IS-IS Flex-Algo in production AI data centre fabrics.
Chapter 23
AI Networking Security
The complete security layer for AI fabrics: RDMA threat model, RoCEv2 RKEY protection, Spectrum-X GBP microsegmentation, InfiniBand PKey isolation, BlueField-3 as a security enforcement point, and UFM Cyber-AI anomaly detection.
Chapter 24
Spectrum-X Architecture and the AI Factory Platform
NVIDIA Spectrum-X: Spectrum-4 ASIC, BlueField-3 SuperNIC, DOCA, NetQ, and the vertically integrated Ethernet platform behind AI factory fabrics.
Chapter 25
RoCE Configuration and Operations on Spectrum-X
Configure RoCEv2 end to end on Spectrum-X: prerequisites, NVUE Day-0 workflow, QoS architecture, ECN/PFC tuning, and production verification.
Chapter 26
Adaptive Routing and Per-Packet Spraying on Spectrum-X
Why flow-based ECMP fails AI collectives, how Spectrum-4 Adaptive Routing reacts to queue depth, and how BF3 reorder buffers make per-packet spraying viable on Spectrum-X.
Chapter 27
BGP-EVPN Multi-Tenancy on Spectrum-X
How Spectrum-X enforces tenant isolation using BGP-EVPN, VXLAN, and GBP microsegmentation - from VNI planning to route-target configuration and operational troubleshooting.
How it works
Three systems. One learning loop.
Structured chapters
Connect hardware, transport behavior, congestion control, and operator workflow in one guided read.
Stateful CLI labs
Commands read live lab state, so the simulator reacts to the exact fault you are tracing.
Community feedback
Leave chapter notes, lab corrections, and platform issues right where the technical context lives.
Who this is for
Built by a network engineer, for network engineers.
CCNP / CCIE engineers
You can read BGP tables and design VxLAN fabrics. You have not spent time inside InfiniBand or RoCE yet. FabricLab is the transition path.
HPC cluster administrators
You manage the servers, but the fabric still feels opaque. FabricLab closes the gap between compute operations and network operations.
Cloud and platform architects
You are designing GPU infrastructure and need to understand what lossless fabrics demand at the protocol and operational level.
Network engineers growing into AI infrastructure
AI fabrics are a fast-moving specialisation. FabricLab gives you a structured path before the first production incident lands on your desk.
Community
Open source. Community reviewed. Always free.
Comment where the issue appears
Leave technical corrections, lab glitches, and operator notes directly on the relevant chapter or lab page.
Contribute through the repo
Fix a chapter, add a lab, or improve a visualisation through focused pull requests on GitHub.
g-arjuna/fabriclabFree to read, forever
Every chapter and lab is free to access. Sign in only when you want synced progress or discussion tools.
Start learning
Learn the fabric in the open.
Every chapter is free to read. Sign in to track your progress, join chapter discussions, and keep your learning history in one place.