Historically, parallel file systems have required additional client software to be loaded on each client machine that needs to work with high-performance data sets. Added software can be difficult to get approved by security team standards, slow to be added to workstation images, and is typically charged per client software instance. These challenges have made it difficult to give access to all users and applications that could derive value from the data sets housed in the parallel file system.
All of this began to change with the vision of the Linux community developing an embedded parallel file system as part of the NFS protocol. With the creation of pNFS (parallel NFS), compute-standard Linux clients now can read and write directly to the storage, and scale performance linearly and near infinitely. Expensive and proprietary software is no longer needed to create a parallel file system; pNFS is built into open standards.
For those of you who have been in the industry for a while, you know that the development of pNFS has been underway for quite some time and the vision has existed even longer.
I recently invited Trond Myklebust, Linux NFS Kernel Maintainer and CTO of Hammerspace, to join the Data Unchained podcast to discuss the role of the Linux NFS client in making high-performance access to decentralized data a reality for even the most conservative and secure enterprise architectures. We discussed the original vision, as well as the evolution of pNFS making today’s use of pNFS a reality! A recap of the episode is below, and you can listen to the full episode to learn more.
Linux
Linux is distributed under an open source license, providing freedom of use and freedom of choice. It has become one of the most popular operating systems in the world throughout the enterprise and for developers.
The origin of the NFS client
The NFS client resides under the application interface, making data access transparent to the user and application. NFS users typically perform the same application operations that they would for local file systems, such as opening files, reading, and writing them. Underneath this activity, the NFS client is converting these commands into remote procedure calls (RPCs) to the server, which executes the required operations.
The vision and evolution of pNFS
Most high-performance computing (HPC) applications require a parallel file system to ingest and process the massive data that loads data onto clients by communicating with them directly. Historically, proprietary software from technologies such as IBM SpectrumScale (previously IBM GPFS), Lustre, Panasas PanFS, Quantum StorNext, etc. have been required to deliver the performance needed for HPC, as well as media and entertainment post-production environments. These workflows historically have not had open solutions available to them; only specialized workflows and a limited set of users and applications could work with the data sets in the parallel file system.
The vision of the pNFS protocol has been under development for the last 20 years to solve this problem. NFSv4 introduced the capability to provide a parallel file system client as part of a standard Linux distribution. It adapts NFS for modern use cases, involving large-scale operations for high performance computing data centers and cloud platforms.
Metadata operations outside the data path
pNFS clearly defines the abstraction between metadata and data. The metadata goes to a traditional data server, where it performs all the opens and closes on the data. When a client wants to talk to a data server, it requests a layout that maps the data’s location and its means of access. Block and SCSI storage requires developers to add a number of features that allow the metadata server to recall the layout.
pNFS temporarily halts I/O during this process so it can perform management operations on the data. It then requests another layout before resuming I/O. This feature allows the pNFS model to manage data on the fly, which is highly useful under many circumstances. For example, the Hammerspace Global Data Environment performs storage tiering, metadata replication, and high-performance movement of data across different sites leveraging these capabilities.
pNFS in file data storage and data management solutions
Hammerspace contributed the Flexible Files technology which makes it possible to provide non-disrupted access to data by applications and users while performing live data movement across storage tiers and storage geographic locations. Flexibles Files put to work in the Global Data Environment can non-disruptively recall layouts, which enables data access and data integrity to be maintained, even as files are being moved or copied. This has enormous ramifications for enterprises as it can eliminate the downtime associated with data migrations and technology upgrades. Enterprises can combine this capability with software, such as a metadata engine, that can virtualize data across heterogeneous storage types, and automate the movement and placement of data according to IT-defined business objectives.
Hammerspace Global Data Environment delivers an enterprise-ready, standards-based parallel file system
Hammerspace has contributed heavily to both the Linux community and in its own development to make high-performance access to data stored in multiple storage tiers and global locations a reality.
In the Hammerspace Global Data Environment:
- Enterprises with high-availability requirements and stringent security policies benefit from parallel file system performance with no additional software installation on client machines
- Parallel file system performance is possible in the Hammerspace Global Data Environment servers due to metadata and data service operations being offloaded from the data path
- Secure sharing of data across Linux and Windows platforms in the Global Data Environment is possible due to NFS v4.2 access control lists (ACLs) compatibility with Windows ACLs
About Trond Myklebust
Trond was originally educated as a particle physicist. He was working on his doctoral thesis, but abandoned it when he discovered that he preferred writing code. While at the University of Oslo, he began to engage with the Linux kernel developers to fix bugs he was identifying. In these communications, Alan Cox, who was at the time, Linux’ number two developer, suggested Trond as a maintainer of the Linux NFS kernel. Trond subsequently moved out of academia to his first industry position at NetApp, then moved on to Primary Data, and now holds a dual role as the CTO of Hammerspace while continuing the Linux NFS Kernel maintainer role.
Learn More
Listen to my full discussion with Trond on Episode 24 of the Data Unchained podcast here or at the links below:
About the Data Unchained Podcast
The paradigm for data access has changed in today’s decentralized world. Getting high-performance data to distributed applications, multiple data centers or cloud regions, and to remote workers who need to collaborate is a real challenge. The Data Unchained Podcast digs into these challenges and the solutions available that help make data a globally available resource. Contact us to set up a meeting and learn more about how Hammerspace can help you drive business value from your data and accomplish your digital transformation goals.