Hammerspace Introduces an All-New Storage Architecture | Read the Press Release

New Storage Architecture for GPU Computing at Hyperscale

By Dan Duperron, Senior Technical Marketing Engineer 

I’ll never forget ELIZA. I was in fifth grade and infatuated. I spent weeks typing in her source code from a book and then debugging it, trying (and failing) to bring her to life. Though I never got to have a conversation with ELIZA, she was my first experience with anything resembling AI.

These days, of course, we don’t hard-code AI algorithms, we “train” them — or they train themselves — using multiple petabytes of input data. This means that as enterprises deploy AI and similar compute-intensive workloads, they require access to large amounts of data and storage infrastructure with the scale and speed once found only in university or government labs. Unfortunately, the storage architectures we all grew up with can’t serve as data pipelines for high-performance computing in the enterprise.

The solution to this dilemma lies in an architecture known as hyperscale NAS. In this blog I’ll explore the need for hyperscale NAS, describe its characteristics and origin, and show an example of what it’s capable of.

Traditional Shared Storage Architectures Don’t Fit

Even with clusters of thousands of the latest GPUs, a single training run for a large AI model can take weeks or months to complete. Every clock cycle is precious, so providing a fast data pipeline to maximize GPU utilization is vital. At very large scale — hyperscale — this is a significant challenge.

There are two common architectures for shared storage today: scale-out NAS and high-performance (aka HPC) parallel file systems. Both have limitations that make them inappropriate for these new workloads in enterprise environments.

Figure 1 – Traditional Storage Architectures

Scale-out NAS relies on a matrix of nodes for data storage and network connectivity. These systems are easy to use, share files using standard protocols, and have well-developed data services with RAS (reliability, availability, and serviceability) features that enterprises rely on. Nodes in legacy scale-out NAS architectures communicate with each other over a fast back-end network.

Unfortunately scale-out NAS suffers from diminishing incremental performance gains as node counts increase, in part because the volume of inter-node communication increases faster than the node count. Using faster and lower-latency networking such as InfiniBand or RDMA on Ethernet (RoCE) helps, but performance of these systems rarely scales beyond a few dozen nodes.

HPC parallel file systems separate metadata and data, empowering clients to talk directly to storage. This eliminates the need for an inter-cluster network and boosts performance. But these systems use special software that must be installed on every client machine, making them proprietary.

Put yourself in the shoes of a CIO shopping for storage to support a data-intensive GPU computing workload. How would you define the ideal shared storage architecture? I think the wish list would look something like this:

Shared Storage Wish List for GPU Computing Project
#RequirementScale-Out NAS?HPC Parallel FS?Hyperscale NAS?
1Standards-based, non-proprietaryYesNoYes
2Doesn’t require additional exotic networking (IB, NVMe-oF, etc.)MaybeMaybeYes
3Works with any storage,including what we have nowNoYesYes
4Supports the RAS data services we needYesNoYes
5High performance that scales linearlyNoYesYes

Table 1 – Shared Storage Requirements for Enterprise GPU Computing

In other words, CIOs are looking for the best of scale-out NAS features and familiarity, with the scalability and performance of an HPC parallel file system. It turns out that’s the definition of hyperscale NAS.

pNFS and The Origin of Hyperscale NAS

Like most innovations, hyperscale NAS didn’t appear out of nowhere. It’s the result of the collaborative efforts of many people at a variety of organizations, public and private, over many years. The short version of the story is that first there was NFS, then parallel NFS (pNFS), and then the Flexible File (aka Flex Files) layout for pNFS, which enables the hyperscale NAS data plane.

Figure 2 – NFS Milestones

NFSv4 moved the protocol from stateless to stateful. That work was refined and expanded with v4.1, which is more efficient, performant, and feature-rich than ever. Finally, NFSv4.2 described a set of optional features that extends the previous NFSv4.1 functionality.

Parallel NFS is an optional feature of NFSv4.1 and v4.2. It uses the familiar tactic of separating metadata from data to enable clients to talk directly to storage systems.

The pNFS specification defines three roles and three protocols. Clients use the metadata protocol (pNFS) to talk to metadata servers, and one of several storage-access protocols to talk to storage devices (really storage servers). Metadata servers use a control protocol to talk to storage devices.

pNFS didn’t define the control protocol, considering it out of scope. Unfortunately, this guaranteed that the server side of any pNFS implementation would be proprietary.

Luckily, pNFS also included the concept of “layouts.” Layouts enable different types of storage to be used with pNFS, such as block, file, object, etc. When the metadata server hands a layout to a client, it confers both the rights to use a defined chunk of storage and the paths and protocols to reach that storage.

Figure 3 – Original pNFS versus pNFS with Flex Files

In 2018, a new layout type was defined, the pNFS Flexible File layout. The key difference compared to existing layout types is that with the Flex Files layout, the control protocol is now defined. Critically, it is defined using standard NFSv3 operations.

This change, defining a standards-based option for the previously undefined pNFS control protocol, eliminated the implicit requirement for the metadata server and storage servers to be purchased as parts of a proprietary package, or even purchased at all! Both the client and server software for pNFS are built into Linux. There are no hardware dependencies — any NFSv3 storage server, whether commercial NAS, white box, or fully custom may participate in a pNFS Flex Files cluster.

Metadata and Data Services

Separating the metadata from the data path helps performance and provides an opportunity for centralization of services. Because the metadata servers represent the nerve center of a hyperscale NAS cluster, this metadata control plane is where data services are implemented, including snapshots, clones, versions, replication, monitoring, security, etc. This ability to automate cross-platform control makes managing even large and heterogeneous clusters easy.

Data services are elevated from individual storage systems into the global metadata control plane, where they are centrally and uniformly managed. The performance and capability of the metadata servers is therefore crucial. With the data path to the underlying storage commoditized, the control plane is where the innovation and differentiation between Hyperscale NAS providers will become evident.

High Performance and Scalability

Every architecture works great on the whiteboard, right? It’s one thing to theorize about how a system will perform, and another to demonstrate it. Here at Hammerspace, we were confident in our implementation of the hyperscale NAS architecture, but it was only recently that we had the opportunity to prove it at scale.

The illustration below tells the story. 1,000 commodity storage servers contain 42PB of NVMe storage holding training data, served to 4,000 Linux nodes hosting 32,000 GPUs.  Without installing proprietary clients on the servers, or needing to migrate data to a new repository, a single hyperscale NAS cluster saturates the network and storage capacities to achieve 12.5TB/s of throughput. RDMA was not used, just standard Ethernet.

Figure 4 – Hyperscale NAS Implementation

Summing Up

New CPU/GPU and storage-intensive workloads such as AI model training have exposed the limitations of traditional shared storage architectures. A new architecture is needed that is standards based, software defined, high performance, hyper-scalable, and “civilized” enough for the corporate datacenter. Hyperscale NAS is that architecture.

Resources

There are lots of places to learn more and even get involved in shaping the future direction of the technology I’ve mentioned here. I urge you to check out the resources below.

The Internet Engineering Task Force continually works to advance Internet standards, including storage protocols like NFS. In addition to reading the RFCs linked throughout this document, you can learn more about standards in development on their website.

The Networking Storage Forum of the Storage Network Industry Association is a great place to learn about the latest networked storage (and storage networking) technology.

The InfiniBand Trade Association publishes lots of great information on IB, of course, but also on RDMA technology in general, including RoCE (RDMA over Converged Ethernet).

NVIDIA’s GPU-Accelerated Libraries Forum is where you can find a community of people working with GPUDirect Storage.

Finally, Hammerspace has recently started a Parallel NFS Resource Center to help spread the word about the capabilities of pNFS.

About the Author

Dan Duperron is a Senior Technical Marketing Engineer at Hammerspace. After wasting his electrical engineering degree working in corporate IT, he fell down the data storage rabbit hole and has never been happier. He particularly enjoys getting other people excited about new and clever storage technology.