Covering Scientific & Technical AI | Tuesday, November 19, 2024

Beat the GPU Storage Bottleneck for AI and ML 

Data centers that support AI and ML deployments rely on Graphics Processing Unit (GPU)-based servers to power their computationally intensive architectures. Across multiple industries, expansion in GPU use is behind the over 31 percent CAGR in GPU servers projected through 2024. That means more system architects will be tasked to assure top performance and cost-efficiency from GPU systems.

Yet optimizing storage for these GPU-based AI/ML workloads is no small feat. Storage systems must process massive data volumes at high speed while tackling two challenges:

1) Server utilization. GPU servers are highly efficient for the matrix multiplication and convolution required to train large AI/ML datasets. However, GPU servers cost 3x a typical CPU server. To maintain ROI, IT staffs need to keep GPUs busy. Unfortunately, extensive deployment experience has shown GPUs are only utilized at 30 percent of capacity.

2) The GPU storage bottleneck. ML training datasets typically far exceed GPU’s local RAM capacity, creating an I/O chokepoint that analysts call the GPU storage bottleneck. AI and ML systems end up waiting, and waiting, to access storage resources – their massive size impeding timely access and thus performance.

Excelero's Kirill Shoikhet

To address this, NVMe flash SSDs gradually have displaced standard Flash SSDs as the go-to choice for Al/ML storage. NVMe enables massive IO parallelism, about 6x the performance of comparable SATA SSDs and up to 10x lower latency, with better power efficiency. Just as GPUs have furthered high-performance computing, NVMe flash has enabled greater storage performance, bandwidth and IO/s while reducing latency. NVMe flash solutions can load AI and ML datasets multiple times faster to the application and avoid starving GPUs.

In addition, NVMe over Fabrics (NVMeoF) – which virtualizes NVMe resources across a high-speed network – has enabled storage architectures that are particularly suitable for AI and ML. NVMeoF provides GPUs with direct access to an elastic pool of NVMe, so all resources can be accessed with local flash performance. It enables AI data scientists and HPC researchers to feed far more data to the applications so they can get to better results faster.

Achieving top GPU storage performance involves fine-tuning infrastructures in line with business goals. Here are four approaches to consider:

1) Efficiently expanding GPU storage capacity  For example, InstaDeep offers an AI-as-a-Service solution for organizations that may not have the need or means to run their own AI stack. As a result, InstaDeep requires maximum ROI and scalability. In particular, the demands for multi-tenancy means the infrastructure must be ever-ready to meet performance requirements for a wide range of workloads and clients.

Early on when deploying its first GPU server system, the InstaDeep infrastructure team learned that the local GPU server’s storage capacity would be too limited, having only had 4TB of local storage while customers’ workloads require 10s to 100s of terabytes (TBs). The team investigated external storage options and noticed that with traditional arrays they would get much more capacity but the performance ultimately would hinder AI workloads since applications needed to move data to and from the GPU systems, interrupting the workflow and impacting system efficiency.

By using software-defined storage to pool NVMe flash across a fast RDMA network – an approach that loads datasets up to 10x faster – InstaDeep achieved far greater GPU capacity utilization, eliminating the GPU bottleneck and also improving ROI, since existing GPUs were being more fully utilized.

2) Tuning for performance at scale The Science and Technology Facilities Council (STFC) typifies how the rapid growth in AI deployments and the size of ML training data sets can tax computing infrastructures. Although it had added high-end GPU servers for higher computational support, STFC lacked the enterprise‐level storage functionality required to scale out the resource across the hundreds of researchers.

By implementing a NVMe-over-Fabrics (NVMeoF) protocol over a high speed network with RDMA capabilities, such as Infiniband or RDMA over Converged Ethernet (RoCE) v2, large AI/ML user groups, such as STFC, can virtualize unused pools of NVMe SSD storage resources on various servers, so they perform as if they were local. By doing this, STFC completed machine learning training tasks in an hour that formerly took three to four days. The GPU storage no longer presents as a bottleneck, even with its complex model training tasks.

3) Using pooled NVMe storage under parallel file systems When AI and ML applications involve accessing a large number of small files from many GPU servers, deploying a parallel distributed file system as the storage infrastructure becomes a necessity. A parallel file system also makes it easier for storage to deliver the high throughput and low latency required of most AI/ML uses. Having fast, elastic, pooled NVMe storage beneath the parallel file system can improve the handling of the metadata, which enables a much higher read performance and better latency and therefore higher GPU server utilization.

For example a hyperscale technology provider recently debuted an AI solution for creating vehicle collision estimates used by insurance companies. To develop the AI logic behind the application, the application workflow involved training models by ingesting up to 20 million small file data sets– files of between 150 – 700 KB each. Data ingest typically took place at a pace of 1 million files every 8 hours, or up to 35K files per client per second.

By using a pooled NVMe storage approach under a parallel distributed file system, the technology provider eliminated the storage bottlenecks it was experiencing – and achieved 3-4x improved storage performance.

4) Examining GPU-specific “open highways” New data center architectures are tackling performance across server, networking and storage in a unified way. One such approach that debuted in fall 2019 integrates infrastructure elements from multiple vendors with GPU-optimized networking and storage to open a direct data path between GPU memory and storage, bypassing the CPU altogether. This enables data travel on “open highways” offered by GPUs, storage and networking devices – achieving frictionless access to NVMe’s superior performance at enterprise scale.

The brisk pace of innovation in AI and ML means deployments today are relying on technology that didn’t exist a year ago – and likely will be superseded next year. IT teams that become adept at fine-tuning GPU storage performance now, aware of the many new options before them, can achieve optimal system utilization and ROI that enables a competitive edge to their organizations.

Kirill Shoikhet is chief architect at Excelero, which provides software-defined block storage solutions for shared NVMe at local performance.

 

AIwire