How to Architect your System for More Efficient AI Model Training
In December of 2018, the first benchmark results for MLPerf were submitted by Intel, Google and NVIDIA. The results measured the performance of different Machine Learning algorithms on the submitters' various hardware in terms of the time to train to an accuracy threshold. However, even with the wealth of information provided in these submissions, the official process is somewhat lacking in describing the systems outside of the primary compute resources.
With that in mind, I've been running the same benchmarks in our Micron Storage Customer lab to better understand how deep learning training stresses system resources. At Micron, we are pursuing memory and storage innovations to deliver dramatic improvements in moving, accessing and analyzing data. This post will provide an overview of my findings, and we'll publish additional results soon.
The system I used to run these tests is a SuperMicro SYS-4029GP-TVRT server. It's an 8-GPU server with GPU-compute performance nearly identical to the NVIDIA DGX-1. The main specifications are:
- 2x Intel Xeon Platinum 8180M CPUs
- 28 cores @ 2.5 GHz
- 3TB Micron DRAM
- 24x 128GB DDR4 2666 MHz DIMMs
- 8x NVIDIA V100 GPUs
- SXM form factor
- 32GB HBM2 memory per GPU
- NVLink Cube Mesh interconnect between GPUs
- 8x 7.68TB Micron 9300 Pro NVMe drives in RAID-10 for datasets
The OS was Ubuntu 18.04.2 with all updates installed and tests were executed with the docker containers defined in the NVIDIA submitted code for their v0.5.0 submission.
The first question I wanted to answer was whether the MLPerf submissions are reproducible. As MLPerf is still under heavy development, showing that results can be reproduced is an important step in the process.
The results of my own testing corroborate the results that NVIDIA submitted. This isn't surprising or ground breaking, but it is important.
Before we dive into the actual results, let me take a quick aside to describe one of my processes for benchmarking these applications. Due to the small size of the datasets available for benchmarking, I limited the amount of memory the containers had access to during training to ensure the full dataset didn't fit in the file system cache. If I had not limited the memory, the application would have read the dataset into memory during the first training epoch, then all subsequent epochs would fetch the data from the file system cache instead of doing any reads from disk.
And while this is an ideal way to run applications — have enough memory to fit your entire dataset — it's pretty unrealistic. The largest dataset used by the benchmarks is the ImageNet dataset for the image classification benchmark that clocks in at 136GB. Here at Micron, we have datasets for training that are several Terabytes in size, and we hear from our customers that they see the same thing: production datasets are quite a bit larger than the datasets used in these benchmarks.
While I can't change the datasets these benchmarks use, I can change how the system is configured to make the results more representative of what we see in the real world.
Let's find out if limiting the container memory worked. With no limits placed on memory, I saw nearly negligible average disk throughput. However, once I limited each container to an amount of memory such that only a small portion of the dataset would fit in memory. I saw drastically different disk utilization: (note that the scale for the vertical axes are different in these two charts)
For Image Classification, we see 61x higher disk utilization after we limited the amount of memory to the container. (This makes a lot of sense as the training process is 62 epochs long and the unlimited memory configuration only needed to read from disk for the first epoch.)
Okay, so we know that limiting the memory available to the container will change disk utilization, but what does it do to the performance of our training process? It turns out, that so long as you have fast enough storage there is a negligible impact to your application:
This result is a bigger deal than it initially seems. What we've shown is that, so long as your storage can keep up during the training process, you'll be able to get full utilization of your GPUs (dependent on your software stack) even if your dataset is too big to fit in local memory.
Now that we’ve covered storage, let’s look at the rest of the system:
The CPU utilization listed here is non-normalized. This means 500% would correspond — roughly — to 5 cores being 100% utilized (or 10 cores being 50% utilized).
Running these GPU-focused applications can still stress the CPUs in the training server. Depending on your application, this data might suggest you can skimp on your CPUs or it might show that you need to invest in top-end CPUs. Regardless, be sure to understand your workload to best architect your AI training servers.
On the GPU utilization side, we see generally high GPU processor utilization, but the memory utilization can be fairly low, depending on the benchmark. I recommend taking the GPU Memory utilization data with a grain of salt, however. I expect that the memory utilization is largely impacted by the total dataset size. While I was able to limit the total memory available to the container, I was not able to limit the GPU memory.
The parameters used in training for these algorithms are tuned for the fastest time to accuracy. One of the primary tunables — batch size — is highly dependent on available GPU memory as well as the total dataset size. This means that a small dataset may train faster with a small batch size that doesn’t fully utilize the GPU memory, and I expect we are seeing this effect here.
We’ve explored the system requirements for running AI applications in isolation on a single server, and, while the results are interesting and provide us with information we can use for architecting systems, this doesn’t really tell the whole story. During my testing I loaded all the MLPerf datasets to the local storage in the AI training server —about 400GB total — then ran each training workload in sequence.
In production systems, however, we rarely have enough storage capacity in the training server to hold all training datasets. Local storage in an AI server is frequently used as a cache — a dataset will be loaded to the cache, a model will be trained, the dataset will be flushed from the cache. Additionally, for a sequence of models being trained, the dataset for the next model should be loaded while a previous model is training.
Let’s explore the effect on the benchmark result when running simultaneous data ingest. For the following data, I reduced the accuracy requirements on the MLPerf benchmarks so each training process would finish more quickly. This allowed more iterations of the tests with different data ingest parameters. Additionally, I’m going to compare the Micron 5200 Pro SATA SSD to the Micron 9300 Pro NVMe SSD (I used 8 of each drive in a RAID-10 configuration).
First, how does NVMe compare to SATA when we’re not running a simultaneous data ingest?
SATA and NVMe drives perform very similarly for AI training workloads when the workload is executed in isolation. The results shouldn’t really be surprising, given the disk throughput numbers we saw above it’s easy to see that 8x 5200 SATA drives are able to provide enough performance to handle our most intensive training workload (Image Classification at 1.2GB/s).
Now let’s find out how quickly can we ingest data while a model is training. For the following charts I used the tool FIO to generate the file writes using the ‘sync’ IO engine with multiple jobs.
This result is now significantly different; NVMe supports drastically higher data ingest rates while the AI model is being trained. This clearly shows that NVMe will support the parallel Data Ingest + Model Training process described above. However, we needed to verify that this heavy ingest didn’t impact the performance of the training process.
There are a few interesting things to take from this chart:
Image Classification, while being the highest disk utilization benchmark, was not sensitive to simultaneous data ingest on SATA or NVMe.
The Single Stage Detector showed a big performance hit on SATA from the data ingest — on the order of 30%.
Conversely, The Object Detection benchmark had a performance hit on NVMe — about 7% — while SATA maintained performance.
The reason for this behavior is currently unknown, but that didn’t stop us from exploring how to mitigate the performance loss. Two easy ways to try and reduce the performance loss were: doing the ingest as a larger block size — 1MB vs 128k — and limiting the ingest throughput rate.
Let’s see how these options worked for the two cases mentioned here. For the Single Stage Detector on SATA we have the following results:
By increasing the block size of data ingest, we could mitigate nearly all of the performance loss from doing the data ingest.
Now the Object Detection benchmark:
Here the story is slightly different. We mitigated some of the performance loss by increasing block size but needed to reduce the ingest rate to get the full application performance.
So, what did we learn?
- Training AI models not only stresses the GPUs, it also stresses storage, system memory, and CPU resources. Be sure to take these into account when you’re architecting your AI systems.
- The Micron 9300 PRO NVMe SSD supports significantly higher data ingest than the enterprise SATA SSDs which helps enable the parallel data science pipeline
- Heavy data ingest can have a negative impact on the time to train a model and is NOT simply correlated with the read throughput of training a model.
- Increasing the block size of data ingest or limiting the write throughput rate can mitigate the performance loss from doing simultaneous data ingest.
We’ll continue to explore the performance of AI and Data Science applications and how best to architect solutions for your workloads. We’ll post here as we learn more about this constantly changing ecosystem.