Hammerspace provides storageless data. That statement itself will raise many questions about the specific use cases that Hammerspace can solve. The use cases are merely the shape and form of the fundamental Hammerspace solution, namely separating metadata from data so that they can be orchestrated independently of one another. The goal of this blog is to take the reader quickly through the evolution of the abstraction we call a “file”, with the hope that it makes all the use cases Hammerspace addresses readily apparent. The reader is expected to be familiar with basic storage concepts.
The Evolution of File Abstraction
A file is a logical unit of data (a string of bits) that makes sense to an application. An image or video file makes sense to a rendering application, a database file to a database application, and so on. When an application creates a file that must be persisted it stores it on a file system. It follows that the file system must be able to retrieve the data (the string of bits) that the file represents, when the application wants it, and allow a subsequent modification of the data as well.
So, we can say that the operations that a file system must support are as follows. Note that I name the operations colloquially, and they don’t necessarily map to any specific filesystem’s operations:
- FIND – given a file name/path, check if the file exists and get its attributes and its (file)handle
- READ – given the file name/path/handle, read the data or metadata of the file
- WRITE – given the file name/path/handle, create/modify the data or metadata of the file
- DELETE – given the name/path/handle, delete the file
The evolution presented here is not necessarily chronological, but represents a movement in the following dimensions:
- From a strongly coupled to a loosely coupled parallel system
- From a finite storage capacity system to an infinite system
- From a proprietary system to an open system
- From a storage-centric to a storageless system
Serial Access File Systems
File data and metadata are stored and accessed via common componentry, such as CPU, memory, network, and storage media. Changes to data as well as metadata are appended to the end of all existing data and metadata. This usually happens on tape drives where, to construct a file system view, the file system must serially scan entries on the tape, starting at the beginning, to arrive at the current consistent view.
Updates to metadata and data are very strongly coupled. They are not only persisted onto the same storage media but are also persisted interleaved due to the serial access constraint. Caching and other mechanisms provide mitigation but the nature of serial access to tape prevents random and parallel access. The implementation is very storage centric as it’s designed with tapes in mind. An example of this is the Linear Tape File System (LTFS)
Random-Access File Systems
The file metadata and data are, both, stored and accessed via common componentry – CPU, memory, network and storage media. This type of file system is widely used both in personal computers as well as enterprise class storage systems. The storage is usually formatted upfront which means the storage is logically divided into two parts – one to store metadata and the other to store data. This is illustrated by the diagram below.
This allows operations on metadata to not require reading any of the data. Updates to data can update the metadata inplace. For example, as files grow in size the metadata can track all the new blocks spanning their data and can also stitch them together for complete random access by the application. The EXT3 file system is a good illustration of this.
A client-server architecture can also be implemented on top of such a file system, where the metadata and data are hosted on a server system and accessed by a client system over a file access protocol like NFS or SMB, or an object-access protocol like S3 or Azure Blob.
Parallel File Systems
In this case, file metadata and data are accessed through separate componentry and within the same data center as the systems running the applications accessing the data. This type of file system is realized using only a client-server architecture where the application runs on the client system and the metadata and data are hosted on separate server systems.
They are called parallel because of the parallelism achieved by various clients in accessing their data across the various data servers. They communicate with the metadata server for metadata operations and then directly communicate with the data storage servers to access their data over separate componentry providing the scale-out benefits expected from such an architecture. The data is completely virtualized in that the data for a given file can reside on any data storage server. An example of this is the Lustre file system.
Distributed File Systems
The file metadata and data are accessed via separate componentry but can, in this case, span multiple data centers. It is implicit that clients, metadata servers and data servers can all run on separate systems as in a parallel file system. However, a major distinction is that the namespace is virtualized. This means that it is hosted and exported by multiple servers in multiple geographical regions, all of them synchronizing with each other to present a consistent and cohesive namespace to clients.
A fundamental trade-off to make here in the architecture of such a system is whether these servers would be strongly coupled (think: distributed locking) or loosely. The goal at Hammerspace has always been to evolve to a loosely coupled, parallel, infinite and open system. Thus, each metadata server (HA-pair strongly recommended) in each region can independently operate in the absence of the other regions; and they synchronize with the other regions as and when they can. This means that we had to implement a conflict-resolution mechanism if multiple regions, called sites, wrote to the same file at the same time. Once the namespace is hosted by multiple sites it stops being hosted by any one site in particular – it becomes Hammerspace! Consequently, on any given site all that a client needs to know to pull out its data (the hammer) is the IP address to mount (the extra-dimensional instantly accessible storage area). It should be pointed out that by open I mean that this all works with a standard out-of-the-box RedHat Linux client.
Storageless Data Orchestration
When you combine the power of such a distributed, loosely coupled, parallel, infinite and open system with three (3) more powerful features you arrive at storageless data.
- Autonomic and live data portability
- A rich intent-based orchestration mechanism
- The ability of an end-user to easily accomplish (1) by using (2)
With live data mobility, the data can freely migrate between data servers within a data center, transparently, without the application realizing it. The data on the data storage server can be accessed either through a file access protocol like NFSv3 or an object-access protocol (e.g., S3, Azure Blob, etc.). The data can also move, transparently, across data centers, whether they are on-premises or in the public cloud, and whether on-demand on client access or by declarative intent on part of an authorized user or administrator. The intent-based orchestration can be predicated on a rich selection of metadata, even ones harvested from the data using algorithms available in the public cloud domain, such as an image recognition function (e.g., AWS Rekognition) or similar services.
Storageless Data builds upon the previous evolution but takes an additional and huge step forward toward a revolution, by managing data directly through its associated metadata instead of through the storage infrastructure. By infusing data with declarative intent and making it portable, Hammerspace frees data from its prison, namely the underlying storage infrastructure. When data is free from those limitations you have Storageless Data!