Hugging Face
Hugging Face (🤗 ) is a platform that allows developers to train and deploy open-source AI models. It's similar to GitHub in that it provides a space for developers to code and deploy AI applications, including language models, transformers, text2image, and more. One of the stand-out features of the platform is “🤗 Datasets” – which is a collection of over 5,000 ML datasets that are available for use. In this guide, we will walk through the processfor configuring HuggingFace Datasets with Storj using S3FS , until a storj-native integration pattern is defined
- Storj S3 compatiable access and secret key (see Storj with AWS SDK)
- A bucket created on Storj (see Create a Bucket)
Storj will use s3fs in order to work with the Hugging Face APIs. First install some dependencies needed
Next enter your Storj S3 compatible access and secret key (see Storj with AWS SDK)
Create a bucket (see Create a Bucket) from the dataset to be stored in. In this walk-through the bucket will be called my-dataset-bucket.
If your dataset is already on Hugging Face Hub, you can use the load_dataset_builder function to download and transfer it to Storj. It'll first download raw datasets to your specified cache_dir then prepare it to uploaded to Storj using the storage_options defined previously. Here we transfer the dataset imdb to Storj
Once you've encoded a dataset, you can persist it using the save_to_disk method.
Use the load_from_disk method you can download your datasets.
