Labs

Pick a FabricLab scenario before entering the simulator

Each lab opens as a dedicated simulator session. Browse the available scenarios first, then enter the one you want to work on. FabricLab labs are best experienced on desktop because they combine topology, multi-device CLI, and reference tooling in one workspace.

14

labs in the catalog

Desktop

recommended for simulator work

Guest

sign in required to enter

Lab 0A12 minDesktop recommended

Fabric CLI orientation

A no-incident calibration lab. Learn how UFM, DGX OS, Linux netdev names, and Cumulus NVUE each describe the same two-rail fabric before you start troubleshooting failures.

CLIUFMDGX OSCumulusRail mapping
Lab 0B15 minDesktop recommended

Read RoCE lossless counters

A healthy-link counter drill. Run a short RDMA write probe, then correlate switch-side PFC and RoCE counters with DGX ethtool counters so ECN marks, PFC pauses, and zero-drop behavior become familiar before incident labs.

PFCECNRoCE countersib_write_bw
Lab 012 minDesktop recommended

Identify the failed rail

A GPU rail has gone dark. Use the topology map and CLI tools to identify which rail, confirm the RDMA state, and isolate whether the fault is on the NIC or switch side.

CLIPhysical diagnosticsRail state
Lab 110 minDesktop recommended

Fix the PFC misconfiguration

A RoCEv2 workload is experiencing retransmissions. PFC is misconfigured. Diagnose and fix it.

PFCRoCEv2CLI
Lab 215 minDesktop recommended

Diagnose fabric congestion

Throughput has dropped 40%. Investigate using interface counters and ECN configuration commands.

CongestionECNCounters
Lab 315 minDesktop recommended

Diagnose uneven spine utilisation

AllReduce throughput has dropped with no drops visible. Diagnose why spine links are uneven and fix load balancing to restore full training throughput.

ECMPLoad balancingSpine counters
Lab 415 minDesktop recommended

Evaluate topology proposals

Two vendors propose different switch configurations for a 64-node DGX H100 cluster. Calculate oversubscription ratios, identify which proposal meets requirements, and submit a recommendation before the purchase order is signed.

OversubscriptionCapacity planningTopology
Lab 520 minDesktop recommended

Diagnose NCCL transport fallback

A 16-node cluster shows 3 GB/s busbw instead of expected performance. All hardware is healthy. Diagnose why NCCL is using socket transport and fix the environment variable misconfiguration.

NCCLTransport fallbackEnv vars
Lab 625 minDesktop recommended

Triage a silent fabric degradation

Training is 12% slower - but no hard errors anywhere. Use UFM port counters, DCGM GPU metrics, and switch counters to correlate a rising pre-FEC BER across three monitoring layers and identify the marginal physical connector before it becomes a full link failure.

MonitoringDCGMUFMTelemetry
Lab 715 minDesktop recommended

Uncover the hidden pause storm

A switch looks healthy at a glance, but the NIC reveals a severe pause storm. Check both ends, identify missing ECN, and restore rate control before continuous PFC pauses collapse throughput.

PFCPause stormECN
Lab 818 minDesktop recommended

Fix the PFC priority mismatch

PFC is enabled, but on the wrong traffic class. Cross-check NIC drops, PFC priority output, and RoCE DSCP-to-priority mapping, then move PFC protection back to priority 3.

PFCPriority mappingRoCE
Lab 920 minDesktop recommended

Recover the err-disabled rail

A rail has gone err-disabled after a physical fault. Recognise the NIC-side active-state trap, confirm the switch port failure, replace the optic, clear err-disable, and verify full recovery.

OpticsErr-disableRail recovery
Lab 1022 minDesktop recommended

ECMP hotspot: BGP bandwidth community

A reduced-capacity spine is still receiving equal ECMP traffic. Identify the missing BGP Link Bandwidth community and restore weighted ECMP before PFC storms spread.

BGPECMPWeighted paths
Lab 1124 minDesktop recommended

BGP suboptimal routing: spine ASN design

A link failure triggers a bad 3-hop path because the spines use different ASNs. Trace the suboptimal route, unify the spine ASN design, and verify clean failover behavior.

BGPFailoverASN design