3 Reasons Why Object Storage Is Right for AI and Machine Learning
Enterprises of all types are ramping up AI and machine learning projects, but realizing their true potential requires overcoming significant technology barriers. While compute infrastructure is often the focus, storage is equally important. Here are three key reasons why object storage — rather than file or block storage — is uniquely suited for AI and ML workloads:
1. Scalability — AI and ML are most effective when there is a large and varied data source to learn from. Data scientists draw on that rich data to train domain models. Of the “five V’s of big data” (volume, variety, velocity, veracity and value), the first two, volume and variety, are the most significant. Simply put, AI and ML depend on huge volumes of highly varied data (images, text, structured and semi-structured) to build useful models, provide accurate results and ultimately deliver business value.
Object storage is the most scalable storage architecture, uniquely suited to support the massive quantities of data needed for AI and ML. Object storage is designed for limitless growth through a horizontal, scale-out approach, enabling organizations to increase deployments by adding nodes whenever and wherever needed. Because object storage uses a single, global namespace, this scaling can also be done across multiple geographic sites at once. On the other hand, file and block systems usually employ a scale-up approach. This means that these platforms grow vertically by adding more compute resources to individual nodes, which is ultimately limited. They are unable to effectively scale horizontally by boosting compute resources via deploying additional nodes.
2. APIs ─ Robust, flexible data APIs are very important for AI and ML, which as noted use a range of data types. Storage platforms need to support APIs to accommodate varieties of data. In addition, AI and ML innovation is increasingly done on public clouds, but there’s still a sizable chunk of AI and ML happening on-premises or in private clouds, depending on the particulars of the use case (for example, capacity-intensive workloads in areas such as scientific research and healthcare tend to be best suited for private clouds). This means organizations need a storage API that supports workloads in both public and on-premises/private clouds.
File and block storage platforms are limited in the APIs they support, in part because they are older architectures. Object storage, in contrast, uses a higher-level API born in the cloud that was designed to be application-centric, and it supports a broader range of APIs than file and block storage, including versioning, lifecycle management, encryption, Object Lock and metadata. In addition, new object storage APIs that support AI and ML use cases, such as support for streaming data and querying of massive data sets, are possible.
With the standardization of object storage APIs around Amazon S3 it’s become easier to integrate software on-prem and in public clouds. Organizations can easily expand AI and ML deployments from on-prem/private cloud environments to the public cloud, or move cloud-native AI and ML workloads to on-prem environments, without loss in functionality. This bi-modal approach enables organizations to leverage on-prem/private cloud and public cloud resources cooperatively and interchangeably.
As the S3 API has become the de facto standard for object storage, many software tools and libraries can take advantage of that API. This allows sharing of code, software, and tools to promote more rapid development in the AI/ML community. Examples include popular ML platforms, such as TensorFlow and Apache Spark that have a built-in S3 API.
3. Metadata ─ As with APIs, it’s critical that organizations using AI and ML leverage unlimited, customizable metadata. Metadata is simply data about data ─ at the most basic level, when and where a piece of data was created and who created it. But metadata can describe more: users can create arbitrary metadata tags to describe any attribute they want.
Rich metadata is necessary for data scientists to locate specific data to build and use their AI and ML models. Metadata annotations allow progressive building of knowledge as more information is added to the data.
File and block storage only support limited metadata, such as the basic attributes described above. This largely goes back to scalability, as file and block systems aren’t equipped for quick, seamless growth, which naturally occurs if a storage system supports rich metadata for AI and ML apps that rely on huge data sets. However, object storage supports unlimited, fully customizable metadata, making it easier to locate data for use in AI and ML algorithms and enabling better insights from it.
Consider the example of a hospital using an image recognition app on X-ray images: With metadata, a TensorFlow model could be employed to analyze each image added to an object storage system, then assign granular metadata labels to each image (e.g., the type of injury, the age or gender of the patient based on bone size or growth, etc.). That TensorFlow model could then be trained on that metadata and analyze it to yield new patient insights (e.g., women in their 20s and 30s are suffering more bone ailments than was the case five years earlier).
With so much noise about AI and ML from almost every Fortune 500 company, it’s conceivable that these technologies will be the most important enterprise IT initiatives for the foreseeable future. However, for AI/ML initiatives to pay off, organizations must leverage the right storage infrastructure. Object storage is an optimal backbone for AI and ML due to its scalability, support for diverse APIs (especially S3), and rich metadata.
Gary Ogasawara is CTO of Cloudian, a data storage company headquartered in San Mateo, Calif.