Micron worked with teams at Dell and NVIDIA to produce industry-leading research on AI training model offload to NVMe, which it showcased at the NVIDIA GTC global AI conference. Micron’s Data Center Workload Engineering team, with the support of Dell’s Technical Marketing Lab and NVIDIA’s storage software development team, tested Big Accelerator Memory (BaM) with GPU-initiated direct storage (GIDS) on the NVIDIA H100 Tensor Core GPU in a Dell PowerEdge R7625 server with Micron’s upcoming high-performance Gen5 E3.S NVMe SSD.
BaM and GIDS are research projects based on the following paper, with open-source code available on GitHub:
- GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture: https://arxiv.org/abs/2203.04910
- GitHub: https://github.com/ZaidQureshi/bam
NVMe as More Memory?
AI model sizes are growing rapidly, and the default method for training large models is to have as much HBM as possible on the GPU, then have as much system DRAM as possible, and if a model doesn’t fit in HBM + DRAM, parallelize over multiple NVIDIA GPU systems.
There is a heavy cost to parallelizing training over multiple servers, specifically in GPU utilization and efficiency, as data needs to flow over network and system links, which can easily become bottlenecks.
What if we could avoid having to split an AI training job over multiple GPU systems by using NVMe as a third tier of “slow” memory? That’s exactly what BaM with GIDS does. It replaces and streamlines the NVMe driver, handing the data and control paths to the GPU. So how does that perform?
Baseline Performance Results
All test results shown were run with the BaM Graph Neural Network (GNN) benchmark included with the open-source BaM implementation linked above.
This first test shows what happens with and without BaM with GIDS enabled. A standard implementation of Linux mmap was used to fault memory accesses through the CPU to storage, representing the test case without specific storage software.
The mmap test took 19 minutes on an NVIDIA A100 80GB Tensor Core GPU and a Micron 9400 Gen4 NVMe SSD. With BaM and GIDS deployed, it took 42 seconds, a 26x improvement in performance. That performance improvement is seen in the feature aggregation component of the benchmark, which is dependent on storage performance.
Gen5 Performance in Dell Labs
At GTC, Micron wanted to prove that our upcoming Gen5 NVMe SSD worked well for AI model offload. We partnered with Dell’s Technical Marketing Labs to get access to a Dell PowerEdge R7625 server with an NVIDIA H100 80GB PCIe GPU (Gen5x16) and completed testing with their excellent support.
GNN Workload Performance | Micron Gen5 H100 | Micron Gen4 A100 | Gen5 vs Gen4 Performance |
---|---|---|---|
Feature Aggregation (NVMe) | 18s | 25s | 2x |
Training (GPU) | 0.73s | 3.6s | 5x |
Sampling | 3s | 4.6s | 1.5x |
End-to-End time (Total of Feature Aggregation + Training + Sampling) |
22.4s | 43.2s | 2x |
GIDS + BaM Accesses/s | 2.87M | 1.5M | 2x |
Feature aggregation depends on SSD performance. Its execution time is 80% of the total runtime, and we see a 2x improvement from Gen4 to Gen5. Sampling and training are GPU dependent, and we see a 5x training performance improvement from an NVIDIA A100 to H100 Tensor Core GPU. High-performance Gen5 SSDs are required for this use case, and a pre-production version of Micron’s Gen5 SSD shows nearly double the performance from Gen4.
What Is BaM With GIDS Doing to Our SSD?
Since BaM with GIDS replaces the NVMe driver, standard Linux tools to view the IO metrics (IOPs, latency, etc.) do not work. We took a trace of the BaM with GIDS GNN training workload and found some startling results.
- BaM with GIDS executes at nearly the max IO performance of the drive.
- The IO profile for GNN training is 99% small block reads.
- The SSD queue depth is 10-100x what we expect from a “normal” data center workload on CPU.
This is a novel workload that will push the top end of NVMe performance. A GPU can manage several parallel streams, and the BaM with GIDS software will manage and optimize for latency, creating a workload profile that may not even be possible to run on a CPU.
Conclusion
As the AI industry advances, intelligent solutions around GPU system utilization and efficiency are incredibly important. Software like BaM with GIDS will enable efficiency in AI system resources by providing a better method to solve larger AI problem sets. There will be a training time impact to extending model storage to NVMe, but this trade-off will allow less time-sensitive large training jobs to be executed on fewer GPU systems, ultimately improving the efficiency and TCO of deployed AI hardware.
This data was used in the following NVIDIA GTC session:
Accelerating and Securing GPU Accesses to Large Datasets [S62559]
Huge thanks to the following folks at Micron, Dell, and NVIDIA for making this research possible:
- Micron: John Mazzie, Jeff Armstrong
- Dell: Seamus Jones, Jeremy Johnson, Mohan Rokkam
- NVIDIA: Vikram Sharma Mailthody, CJ Newburn, Brian Park, Zaid Qureshi, Wen-Mei Hwu
Hardware and Software Details:
- Workload: GIDS with IGBH-Full Training.
- NVMe performance results measured by Micron’s Data Center Workload Engineering team, baseline (mmap) performance results measured by NVIDIA’s storage software team on a similar system.
- Systems under test:
- Gen4: 2x AMD EPYC 7713, 64-core, 1TB DDR4, Micron 9400 PRO 8TB, NVIDIA A100-80GB GPU, Ubuntu 20.04 LTS (5.4.0-144), NVIDIA Driver 535.129.03, CUDA 12.3, DGL 2.0.0
- Gen5: Dell R7625, 2x AMD EPYC 9274F, 24-core, 1TB DDR5, Micron Gen5 SSD, NVIDIA H100-80GB GPU, Ubuntu 20.04 LTS (5.4.0-144), NVIDIA Driver 535.129.03, CUDA 12.3, DGL 2.0.0
- Work based on paper “GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture” https://arxiv.org/abs/2203.04910, https://github.com/ZaidQureshi/bam