Dataset training sizes continue to grow beyond billions of parameters. While some models can fit in system memory completely, larger models cannot. In this situation, data loaders need to access models located on flash storage through various methods. One such method is a memory mapped file stored on SSDs. This allows the data loader to access the file as if it were in memory, but the overhead of the CPU and software stack drastically reduces the performance of the training system. This is where Big accelerator Memory (BaM)* and the GPU-Initiated Direct Storage (GIDS)* data loader come in.
What are BaM and GIDS?
BaM is a system architecture that utilizes the low latency, extremely high throughput, large density, and endurance of SSDs. BaM’s goal is to provide efficient abstractions that enable GPU threads to make fine-grained accesses to datasets on SSDs and achieve much higher performance than solutions requiring CPU(s) to provide storage requests to serve GPUs. BaM acceleration uses a custom storage driver that is designed specifically to enable the inherent parallelism of GPUs to access storage devices directly. BaM is different from NVIDIA Magnum IO™ GPUDirect® Storage (GDS), as BaM doesn’t rely on the CPU to prepare the communication from GPU to SSD.
Micron had done previous work with NVIDIA GDS as noted below:
- The Micron® 9400 NVMe™ SSD Performance With NVIDIA Magnum IO GPUDirect Storage Platform
- Micron Collaboration With Magnum IO GPUDirect Storage Brings Industry-Disruptive Innovation to AI & ML
The GIDS dataloader is built on the BaM subsystem to address memory capacity requirements for GPU-accelerated Graph Neural Network (GNN) training while also masking storage latency. GIDS does this by storing the feature data of the graph on the SSD, since this data is typically the largest part of the total graph dataset for large-scale graphs. The graph structure data, which is typically much smaller compared to the feature data, is pinned into system memory to enable rapid GPU graph sampling. Lastly, the GIDS dataloader allocates a software-defined cache on the GPU memory for recently accessed nodes in order to reduce storage accesses.
Graph neural network training using GIDS
To show the benefits of BaM and GIDS, we performed GNN training using the Illinois Graph Benchmark (IGB) heterogeneous full dataset. This dataset is 2.28TB large and would not fit into most platforms’ system memory. We timed the training for 100 iterations using a single NVIDIA A100 80GB Tensor Core GPU and varied the number of SSDs to provide a broad range of results, as seen in Figure 1 and Table 1.
Figure 1: GIDS Training Time for IGB-Heterogenous Full Dataset - 100 Iterations
|
GIDS (4 SSDs) |
GIDS (2 SSDs) |
GIDS (1 SSD) |
DGL Memory Map Abstraction |
Sampling |
4.75 |
4.93 |
4.08 |
4.65 |
Feature Aggregation |
8.57 |
15.9 |
31.6 |
1,130 |
Training |
1.98 |
1.97 |
1.87 |
2.13 |
End-to-End |
15.3 |
22.8 |
37.6 |
1,143 |
Table 1: GIDS Training Time for IGB-Heterogenous Full Dataset - 100 Iterations
The first part of the training is graph sampling done by the GPU and by accessing the graph structure data within system memory (seen in blue). This value varies little across the different test configurations because the structure stored in system memory does not change between these tests.
Another part is the actual training time (seen at the far right in green). This part is highly dependent on the GPU, and we can see that this does not change much between the multiple test configurations as expected.
The most important section, where we see the largest difference, is feature aggregation (shown in gold). As the feature data is stored on the Micron 9400 SSDs for this system, we see that scaling from 1 to 4 Micron 9400 SSDs drastically improves (reduces) the feature aggregation processing time. Feature aggregation improves by 3.68x as we scale from 1 SSD to 4 SSDs.
We also included a baseline calculation, which uses a memory map abstraction and the Deep Graph Library (DGL) data loader to access the feature data. Because this method of accessing the feature data requires the use of the CPU software stack instead of direct access by the GPU, we can see how inefficient the CPU software stack is at keeping the GPU saturated during training. The feature abstraction improvement versus baseline is 35.76x for 1 Micron 9400 NVMe SSD using GIDS and 131.87x on 4 Micron 9400 NVMe SSDs. Another view of this data can be seen in Figure 2 and Table 2, which shows the effective bandwidth and IOPs during these tests.
Figure 2: Effective Bandwidth and IOPS of GIDS Training vs Baseline
|
DGL Memory Map |
GIDS (1 SSD) |
GIDS (2 SSDs) |
GIDS (4 SSDs) |
Effective Bandwidth (GB/s) |
0.194 |
6.9 |
13.8 |
25.6 |
Achieved IOPs (M/s) |
0.049 |
1.7 |
3.4 |
6.3 |
Table 2: Effective Bandwidth and IOPS of GIDS Training vs Baseline
As datasets continue to grow, we can see the need for a shift in paradigm in order to train these models in a reasonable amount of time and to take advantage of the improvements provided by leading GPUs. BaM and GIDS are a great starting point, and we look forward to working with more of these types of systems in the future.
Test System
Component |
Details |
Server |
Supermicro® AS 4124GS-TNR |
CPU |
|
Memory |
1 TB Micron DDR4-3200 |
GPU |
Memory Clock: 1512 MHz SM Clock: 1410 MHz |
SSDs |
4x Micron 9400 MAX 6.4TB |
OS |
Ubuntu 20.04, Kernel 5.4.0 |
NVIDIA Driver |
535.113.01 |
Reference Links
Big Accelerator Memory Paper & GitHub
GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture (arxiv.org)
GitHub - ZaidQureshi/bam
GPU Initiated Direct Storage Paper and GitHub
Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses (arxiv.org)
GitHub - jeongminpark417/GIDS
GitHub - IllinoisGraphBenchmark/IGB-Datasets: Largest real-world open-source graph dataset - Worked done under IBM-Illinois Discovery Accelerator Institute and Amazon Research Awards and in collaboration with NVIDIA Research.
*Note: NVIDIA Big Accelerator Memory (BaM) and NVIDIA GPU Initiated Direct Storage (GIDS) dataloader are prototype projects from NVIDIA Research and are not intended for general release.