Open platform for AI fabric engineers

Learn the fabricthat keeps AI clusters alive.

InfiniBand. RoCEv2. RDMA. Congestion control. Scale-out fabric design. FabricLab teaches AI and HPC networking through interactive chapters, stateful labs, and a simulator built for network engineers.

Start with Chapter 0 Browse the curriculum

Join the discussion->View the repo

28

published chapters

21

scenario labs

Free

public access model

FabricLab CLI

The gap

There is no Packet Tracer for HPC networking.

Network engineers who can troubleshoot BGP, reason about ECMP, and design VxLAN fabrics still walk into AI data centers and find an unfamiliar world. The knowledge is fragmented across vendor docs, conference talks, and incident writeups.

FabricLab turns that scattered knowledge into a structured, open, community-reviewed platform. Chapters explain the hardware and protocols. Labs let you test commands against live state. Anyone can contribute a correction, a new lab, or a sharper explanation.

21

scenario labs available in the simulator catalog

28

chapters currently published in the open catalog

Curriculum

A structured path from hardware to protocol.

28 chapters. 21 scenario labs. One simulator. All chapters are free to read. Sign in when you want synced progress and discussion.

Open the full curriculum

Chapter 0

The Hardware Story

Physical layer orientation. What an HCA is, why NICs became DPUs, how a DGX node is wired, the three separate networks.

Chapter 1

Operating Systems and Management Platforms

What runs on every device. How you access it after power-on. The management philosophy. CLI vs orchestrated. First power-on sequence.

Chapter 2

Why HPC Networking Is Different

The AllReduce barrier, why TCP fails, tail latency math, and the mental model shift from enterprise to AI networking.

Chapter 3

The CLI - Reading the Fabric

The commands and discipline for reading HPC fabric state. Which commands run where, how to read their output, and the investigation workflow from physical layer to configuration.

Chapter 4

InfiniBand Operations - ONYX CLI and Fabric Management

The InfiniBand operations layer: ONYX CLI, error counter interpretation, Subnet Manager management, ibdiagnet fabric sweep, and UFM event correlation.

Chapter 5

PFC, ECN, and Congestion Control

How losslessness actually works: PFC mechanics at the wire level, pause storm formation, ECN CE bit marking, DCQCN rate control algorithm, and the complete RoCEv2 port configuration checklist.

Chapter 6

Efficient Load Balancing

Why AI traffic is structurally low-entropy and how that breaks ECMP. The four load balancing modes (SLB, DLB, GLB, sDLB). Per-packet spraying and RSHP. In-cast congestion patterns and how to diagnose them from spine utilisation counters.

Chapter 7

Topology Design

How AI fabric scales from one switch to a SuperPOD. Fat-tree topology math, bisection bandwidth, BasePOD vs SuperPOD reference designs, oversubscription calculations, ROD vs RUD wiring, switch buffer selection, and cabling constraints.

Chapter 8

NCCL - The Application Layer

How NCCL translates AllReduce into RDMA operations. Ring vs Tree vs Double-Binary Tree algorithms. The environment variables that determine whether NCCL finds RDMA or falls back to TCP. Reading nccl-tests output. Correlating busbw degradation to fabric diagnostics.

Chapter 9

Optics, Cabling, and the Physical Layer

The physical layer beneath the fabric: 400G/800G optics, DSPs, fiber types, form factors, cable selection, and why signal integrity and power density now shape AI cluster design.

Chapter 10

The Storage Fabric

The separate network that feeds and protects training: storage isolation, GDS data paths, NVMe-oF transports, parallel file systems, checkpoint economics, and storage topology choices.

Chapter 11

Monitoring, Telemetry, and Observability

Know about problems before the ML engineer's Slack message arrives. UFM REST API, DCGM GPU metrics, Prometheus alert design, threshold calibration, and cross-layer correlation across four monitoring streams.

Chapter 12

Scale-Up Networking - NVLink Switch System

External NVLink Switch modules, 57.6 TB/s all-to-all at 256 GPUs, NVLink Network addressing, scale-up vs scale-out architecture decisions, and NVLink Switch diagnostics.

Chapter 13

Alternative Topologies

Torus, folded torus, dragonfly, and TPU Pod design choices - where they came from, what workloads they suit, and why fat-tree remains dominant for AI training clusters.

Chapter 14

GPU Hardware Generations

Network-relevant implications of GPU generations: NVLink/NVSwitch generation table, SXM vs PCIe form factors, GH200, H100 CNX, and Confidential Computing.

Chapter 15

IP Routing for AI/ML Fabrics

How modern AI fabrics use routed Ethernet: BGP unnumbered, ASN design, BGP DPF, RIFT comparisons, Flex Algo, SRv6 path steering, and multi-tenant EVPN-VXLAN design.

Chapter 16

The GPU Compute Network - Packet Anatomy

A packet-level walkthrough from NCCL work queue entries to remote DMA completion: DGX interfaces, Queue Pair mechanics, ConnectX-7 processing, switch forwarding, and end-to-end packet decode.

Chapter 17

Storage Network Packet Path

A packet-level walkthrough of a checkpoint write from GPU HBM to storage appliance: NVMe-oF capsules, DMA paths, storage-fabric behavior, frame anatomy, and diagnostics.

Chapter 18

OOB and Management Network

The out-of-band management fabric: BMC architecture, IPMI and Redfish internals, OOB topology, switch management isolation, UFM communication paths, BlueField-3 management architecture, and hardening guidance.

Chapter 19

IP Addressing and Planning

A complete addressing reference for DGX BasePOD and SuperPOD deployments: address families, RFC 1918 partitioning, loopback design, P2P links, /32 server routes, management-plane planning, VXLAN VNI allocation, and scaling pitfalls.

Chapter 20

Ultra Ethernet Consortium (UEC)

Why RoCEv2 has friction at scale, and how UEC addresses it: UET packet format, SACK-based reliability without lossless fabric, 1-RTT congestion feedback, native multipath spraying, switch requirements, and honest deployment readiness as of March 2026.

Chapter 21

Congestion Control Deep Dive

A rigorous, algorithm-level treatment of every congestion control scheme used in production AI fabrics — DCQCN, Swift, HPCC, TIMELY, and UEC CC — with practical guidance on parameter tuning and algorithm selection.

Chapter 22

Segment Routing for AI Fabrics

A practitioner's guide to deploying SRv6, SR-TE, EVPN+SRv6, and IS-IS Flex-Algo in production AI data centre fabrics.

Chapter 23

AI Networking Security

The complete security layer for AI fabrics: RDMA threat model, RoCEv2 RKEY protection, Spectrum-X GBP microsegmentation, InfiniBand PKey isolation, BlueField-3 as a security enforcement point, and UFM Cyber-AI anomaly detection.

Chapter 24

Spectrum-X Architecture and the AI Factory Platform

NVIDIA Spectrum-X: Spectrum-4 ASIC, BlueField-3 SuperNIC, DOCA, NetQ, and the vertically integrated Ethernet platform behind AI factory fabrics.

Chapter 25

RoCE Configuration and Operations on Spectrum-X

Configure RoCEv2 end to end on Spectrum-X: prerequisites, NVUE Day-0 workflow, QoS architecture, ECN/PFC tuning, and production verification.

Chapter 26

Adaptive Routing and Per-Packet Spraying on Spectrum-X

Why flow-based ECMP fails AI collectives, how Spectrum-4 Adaptive Routing reacts to queue depth, and how BF3 reorder buffers make per-packet spraying viable on Spectrum-X.

Chapter 27

BGP-EVPN Multi-Tenancy on Spectrum-X

How Spectrum-X enforces tenant isolation using BGP-EVPN, VXLAN, and GBP microsegmentation - from VNI planning to route-target configuration and operational troubleshooting.

How it works

Three systems. One learning loop.

Structured chapters

Connect hardware, transport behavior, congestion control, and operator workflow in one guided read.

Stateful CLI labs

Commands read live lab state, so the simulator reacts to the exact fault you are tracing.

Community feedback

Leave chapter notes, lab corrections, and platform issues right where the technical context lives.

Who this is for

Built by a network engineer, for network engineers.

CCNP / CCIE engineers

You can read BGP tables and design VxLAN fabrics. You have not spent time inside InfiniBand or RoCE yet. FabricLab is the transition path.

HPC cluster administrators

You manage the servers, but the fabric still feels opaque. FabricLab closes the gap between compute operations and network operations.

Cloud and platform architects

You are designing GPU infrastructure and need to understand what lossless fabrics demand at the protocol and operational level.

Network engineers growing into AI infrastructure

AI fabrics are a fast-moving specialisation. FabricLab gives you a structured path before the first production incident lands on your desk.

Community

Open source. Community reviewed. Always free.

Visit the community hub

Comment where the issue appears

Leave technical corrections, lab glitches, and operator notes directly on the relevant chapter or lab page.

Contribute through the repo

Fix a chapter, add a lab, or improve a visualisation through focused pull requests on GitHub.

g-arjuna/fabriclab

Free to read, forever

Every chapter and lab is free to access. Sign in only when you want synced progress or discussion tools.

Start learning

Learn the fabric in the open.

Every chapter is free to read. Sign in to track your progress, join chapter discussions, and keep your learning history in one place.

Open Chapter 0 Go to the labs View on GitHub