Hammerspace Introduces an All-New Storage Architecture | Read the Press Release

How to Eliminate Data Silos to Optimize AI Initiatives

The use of data analytics, BI applications, and data warehouses for structured data is a mature industry, and the strategies to extract value from structured data are well known. But the emerging explosion of generative AI now holds the promise of extracting hidden value from unstructured data as well.

AI use cases also hold the promise of helping enterprises create new outcomes by gaining insights previously hidden in large volumes of unstructured file data. This data is typically stored in multiple locations in data centers and across multiple clouds. The solution is now available to utilize all data – structured, semi-structured and unstructured – in AI to create new business value as well as to drive efficiencies.

The problems faced by AI workloads

The problem is that the barriers separating unstructured data silos now become a serious limitation to how quickly IT organizations can implement AI pipelines without costs, governance controls, and complexity spiraling out of control. One look at the technology trade press and you’ll see that there is a feeding frenzy among storage vendors touting one-size fits all solutions to address this problem. But organizations need to be able to leverage the data that already exists and simply can’t afford to throw out existing infrastructure to migrate all their unstructured data to new platforms in order to implement AI strategies.

Creating actionable structure out of unstructured data

AI use cases and technologies are evolving so rapidly that data owners need the freedom to pivot at any time to scale up or down, or to bridge multiple sites with their existing infrastructure, all without disrupting data access for existing users or applications.

Hammerspace provides an actionable structure for disparate silos of unstructured data wherever it is today, without the need for wholesale data migrations or to replace existing storage infrastructure. By providing analytics, artificial intelligence, deep learning and machine learning workflows with unified access and automated control to all data on any storage type anywhere, Hammerspace helps data owners not only leverage their existing storage resources, but also to dramatically accelerate AI workloads for distributed unstructured datasets.

Design for flexibility in your AI journey

There are multiple phases in AI workflows, and of course many different AI use cases that can vary greatly depending on the industry or desired outcome. But as diverse as the AI use cases are, the common denominator among them all is the need to collect data from many diverse sources, and often different locations.

The fundamental problem is that access to data by both humans and AI models is always funneled through a file system at some point. The issue is that file systems have traditionally been embedded within the storage infrastructure. The result of this infrastructure-centric approach is that when data outgrows the storage platform it is on today, or if different performance requirements or cost profiles dictate the use of other storage types, users and applications must navigate across multiple access paths to incompatible systems to get to their data.

Solving the silo problem to unlock the AI puzzle

This problem is particularly acute for AI workloads, where a critical first step is to consolidate data from multiple sources to enable a global view across them all. AI workloads must have access to the complete dataset to classify and/or label the files as the first step to figuring out which of them should be refined down to the next step in the process.

With each phase in the AI journey the data will be further refined. This might include cleansing and large language model (LLM) training, or in some cases tuning of existing LLMs for iterative inferencing runs to get closer to the
desired output. Each of these steps also requires different compute and storage performance requirements, ranging from slower, less expensive mass storage systems and archives, all the way to high-performance and more costly NVMe storage.

The fragmentation caused by the storage-centric lock-in of file systems at the infrastructure layer is not a new problem unique to AI use cases. IT professionals for decades have been faced with the choice of overprovisioning their storage infrastructure to solve for the subset of data that needed high performance, or pay the “data copy tax” and added complexity to shuffle file copies between different systems. This long-standing problem is now also evident in the training of AI models as well as through the ETL process.

Decoupling the file system from the infrastructure layer

Unlike conventional storage platforms that embed the file system within the infrastructure layer, Hammerspace is a software-defined solution that is compatible with any on-premises or cloud-based storage platform from any vendor, creating a high performance, cross-platform Parallel Global File System that spans otherwise incompatible storage silos across one or more locations, including the cloud.

With the file system decoupled from the underlying infrastructure, Hammerspace is able to automate data orchestration as a background operation while providing high performance to GPU clusters, AI models, and data engineers. With Hammerspace, all users and applications in all locations have read/write access to all data everywhere. Not to file copies, but to the same files via this unified, global metadata control plane.

Of critical importance to AI workflows, data classification can be significantly enhanced and automated within Hammerspace. The system includes powerful metadata management capabilities that enable files and directories to be manually or automatically tagged with user defined custom metadata, creating a rich set of file classification information that can be used to streamline the classification phase of AI workflows, and simplify later iterations.

Empowering data engineers with self-service workflow automation

Since many industries such as pharma, financial services, or biotechnology require both the archiving of training data as well as the resulting models, the ability to automate placement of these data into low-cost resources is critical. With custom metadata tags tracking data provenance, iteration details, and other steps in the workflow, recalling old model data for reuse or to apply a new algorithm is a simple operation that can be automated in the background.

In this way, with Hammerspace data scientists may be given direct, self-service control over all stages in the AI pipeline, across multiple locations, storage silos, and the cloud without needing to request data retrieval from IT administrators or needing to get into IT infrastructure management. And because the data can be seamlessly accessed from existing storage resources, these workflows can leverage data in place without the need to replace legacy storage systems with new infrastructure.

In summary

The rapid shift to accommodate AI workloads has created a challenge that exacerbates the silo problems that IT
organizations have faced for years. And the problems have been additive:

  • To be competitive as well as manage through the new AI workloads, data access needs to be seamless across local silos, locations, and clouds, plus support very high performance workloads.
  • And there is the need to be agile in a dynamic environment where fixed infrastructure may be difficult to expand due to cost or logistics. This means the ability for companies to automate data orchestration across different siloed resources or rapidly burst to cloud compute and storage resources has become essential.
  • At the same time, enterprises need to bridge their existing infrastructure with these new distributed resources in a way that is cost-effective, and ensure that the cost of implementing AI workloads does not crush the expected return.

Hammerspace software is ideally suited to provide customers a solution to these problems, without the need to retool their data centers with one-size-fits-all storage that is overprovisioned.

To keep up with both the many performance requirements for AI pipelines, a new paradigm was necessary that could effectively bridge the gaps between on-premises silos and cloud. Such a solution required new technology and a revolutionary approach to lift the file system out of the infrastructure layer to enable AI pipelines to utilize existing infrastructure from any vendor without compromising results.

This is the Hammerspace innovation.

About the Author

Floyd is Vice President of Product Marketing for Hammerspacer. He has been involved with data management and storage for more than 25 years, focused on the methods and technologies needed to manage extreme volumes of data to keep up with the needs of modern, distributed storage resources and workflows.