Hugging Face (🤗 ) is a platform that allows developers to train and deploy open-source AI models. It's similar to GitHub in that it provides a space for developers to code and deploy AI applications, including language models, transformers, text2image, and more. One of the stand-out features of the platform is “🤗 Datasets” – which is a collection of over 5,000 ML datasets that are available for use. In this guide, we will walk through the processfor configuring HuggingFace Datasets with Storj using S3FS , until a storj-native integration pattern is defined
- Familarity and account with Hugging Face (see Quick Start Guide)
- Storj S3 compatiable access and secret key (see )
- A bucket created on Storj (see )
Storj will use s3fs in order to work with the Hugging Face APIs. First install some dependencies needed
Next enter your Storj S3 compatible access and secret key (see )
Create a bucket (see ) from the dataset to be stored in. In this walk-through the bucket will be called my-dataset-bucket.
If your dataset is already on Hugging Face Hub, you can use the load_dataset_builder function to download and transfer it to Storj. It'll first download raw datasets to your specified cache_dir then prepare it to uploaded to Storj using the storage_options defined previously. Here we transfer the dataset imdb to Storj
Once you've encoded a dataset, you can persist it using the save_to_disk method.
Use the load_from_disk method you can download your datasets.