Invalid input. Special characters are not supported.
Data center power consumption for AI is a hot topic. As an example, OpenAI reported in its 2024 financials that its biggest operating expense was electricity.
Looking at artificial intelligence (AI) system architectures highlights the high power use driver. The standard AI training system is an 8-GPU system, with power requirements of up to 10 kWh per system. There are also 4-GPU NVIDIA HGX platforms from various server vendors. For large-scale AI training these power requirements are part of running a business that requires extreme-scale GPU clusters. But what about smaller workloads?
AI everywhere
Enterprise AI is a rapidly emerging space, often with requirements of on-prem AI inference due to proprietary business data deployed into AI models. Leading AI solutions have base large language models (LLMs) that are then fine-tuned on local enterprise data so they can understand all our unique corporate acronym salads. These need to be run locally if they have access to business-critical data.
NVIDIA MGX
NVIDIA launched the NVIDIA GH200 Grace Hopper Superchip, an engineered system with an Arm-based CPU, LPDDR5X and an NVIDIA H100 GPU in a single package. This system aims to supply the compute power of an NVIDIA GPU while optimizing efficiency for everything else in the system.
We recently tested a Supermicro ARS-111GL-SHR, a 1U NVIDIA GH200 system with a 72-core NVIDIA Grace CPU, 480GB of LPDDR5X and a H100 with 96GB of HBM3E. This system also had a NVIDIA BlueField-3 DPU with connections to four Micron 9550 NVMe E1.S SSDs.
By connecting NVMe through the BlueField-3, up to 8 NVMe SSDs can be deployed in a 1U NVIDIA MGX system, allowing considerable storage density per GPU.
This type of dense platform brings some new requirements for deployment and system configuration. With the NVIDIA GH200, there can be two Superchip complexes in 1U, and in the future, NVIDIA Blackwell-based systems can go to four GPUs in a 1U enclosure.
Some important points to consider for these systems:
- Liquid cooling becomes a requirement for dense systems.
- EDSFF storage is a requirement. For 1U systems, E1.S form factor SSDs are optimal, while E3.S is more common for 2U systems.
- Storage performance density becomes important. Because there is not much physical space for storage, these systems will have a small number of maximum-performance SSDs, like the Micron 9550.
This is driving a requirement for PCIe Gen6 storage.
How efficient is an NVIDIA MGX platform with the NVIDIA GH200, really?
To understand the efficiency delta between NVIDIA GH200 and a standard system, we tested two servers using NVIDIA GPUDirect Storage and the legacy IO path.
NVIDIA GPUDirect Storage allows a GPU to route the data path from the GPU to an NVMe SSD, the control path data still flows through the CPU and DRAM. Without GPUDirect Storage, all data flows through the CPU and DRAM bounce buffer, which is a large bottleneck.
The specs of the two systems under test are:
- Intel + NVIDIA H100 GPU System: Supermicro SYS-521GU-TNXR: 2 Intel 8568Y+, 48-cores, 512GB DDR5, NVIDIA H100NVL-96GB HBM3 GPU PCIe Gen5 x16, Micron 9550 PRO SSD.
- NVIDIA GH200 Grace Hopper system: Supermicro ARS-111GL-SHR: NVIDIA GH200 Grace Hopper Superchip with 480GB LPDDR5X, H100 96GB HBM3, NVIDIA BlueField-3 with four PCIe Gen5 4 MCIO connections to front E1.S with Micron 9550 PRO SSD.
The Intel and NVIDIA system represents the way many enterprises deploy AI now, by sticking an NVIDIA H100 in a standard server. The NVIDIA GH200 offers a more efficient path to using the compute of an H100.
The workload tested is 4KB random reads using GDSIO with 256 workers and 40GB files. Using the legacy path, an NVIDIA GH200 is four times more efficient in MB/s per watt. With NVIDIA GPUDirect Storage, efficiency increases 10 times on the Intel system and four times on the NVIDIA GH200. Overall, the NVIDIA GH200 is 60% more energy efficient versus the Intel system when using GDS.
Looking at the average system power draw, the Intel system used 900 W of average power, the NVIDIA GH200 used 350 W.
The efficient path to enterprise AI workloads
The MGX line of systems with the NVIDIA Superchip represents a power-efficient way to leverage the unique compute capabilities of an NVIDIA GPU. From a Micron component standpoint, we make the LPDDR5X, the HBM3E in the H100 GPU, and the Micron 9550 NVMe SSD in E1.S and E3.S form factors optimal for this platform.
The NVIDIA GH200 systems are shipping now from Supermicro, HPE and others.
- Supermicro MGX Grace Hopper Systems: ARS-111GL-SHR was used in this testing
- HPE Proliant DL384