Fabric CLI orientation
A no-incident calibration lab. Learn how UFM, DGX OS, Linux netdev names, and Cumulus NVUE each describe the same two-rail fabric before you start troubleshooting failures.
Labs
Each lab opens as a dedicated simulator session. Browse the available scenarios first, then enter the one you want to work on. FabricLab labs are best experienced on desktop because they combine topology, multi-device CLI, and reference tooling in one workspace.
14
labs in the catalog
Desktop
recommended for simulator work
Guest
sign in required to enter
A no-incident calibration lab. Learn how UFM, DGX OS, Linux netdev names, and Cumulus NVUE each describe the same two-rail fabric before you start troubleshooting failures.
A healthy-link counter drill. Run a short RDMA write probe, then correlate switch-side PFC and RoCE counters with DGX ethtool counters so ECN marks, PFC pauses, and zero-drop behavior become familiar before incident labs.
A GPU rail has gone dark. Use the topology map and CLI tools to identify which rail, confirm the RDMA state, and isolate whether the fault is on the NIC or switch side.
A RoCEv2 workload is experiencing retransmissions. PFC is misconfigured. Diagnose and fix it.
Throughput has dropped 40%. Investigate using interface counters and ECN configuration commands.
AllReduce throughput has dropped with no drops visible. Diagnose why spine links are uneven and fix load balancing to restore full training throughput.
Two vendors propose different switch configurations for a 64-node DGX H100 cluster. Calculate oversubscription ratios, identify which proposal meets requirements, and submit a recommendation before the purchase order is signed.
A 16-node cluster shows 3 GB/s busbw instead of expected performance. All hardware is healthy. Diagnose why NCCL is using socket transport and fix the environment variable misconfiguration.
Training is 12% slower - but no hard errors anywhere. Use UFM port counters, DCGM GPU metrics, and switch counters to correlate a rising pre-FEC BER across three monitoring layers and identify the marginal physical connector before it becomes a full link failure.
A switch looks healthy at a glance, but the NIC reveals a severe pause storm. Check both ends, identify missing ECN, and restore rate control before continuous PFC pauses collapse throughput.
PFC is enabled, but on the wrong traffic class. Cross-check NIC drops, PFC priority output, and RoCE DSCP-to-priority mapping, then move PFC protection back to priority 3.
A rail has gone err-disabled after a physical fault. Recognise the NIC-side active-state trap, confirm the switch port failure, replace the optic, clear err-disable, and verify full recovery.
A reduced-capacity spine is still receiving equal ECMP traffic. Identify the missing BGP Link Bandwidth community and restore weighted ECMP before PFC storms spread.
A link failure triggers a bad 3-hop path because the spines use different ASNs. Trace the suboptimal route, unify the spine ASN design, and verify clean failover behavior.