How to use Storj with Hugging Face

Hugging Face (🤗 ) is a platform that allows developers to train and deploy open-source AI models. It's similar to GitHub in providing a space for developers to code and deploy AI applications, including language models, transformers, text2image, and more.

One of the stand-out features of the platform is “🤗 Datasets” – which is a collection of over 5,000 ML datasets that are available for use.

In this guide, we will walk through configuring HuggingFace Datasets with Storj using S3FS until a Storj-native integration pattern is defined.

Prerequisites

Familiarity and account with Hugging Face (see Quick Start Guide)
Familiarity with Colab or equivalent environment to run code in (see Notebooks)
Storj S3 compatible access and secret key (see Getting started)
A bucket created on Storj (see Create buckets)

Setup Storj with S3Fs

Storj will use s3fs in order to work with the Hugging Face APIs.

First, install some dependencies needed.

pip install -qqU s3fs datasets

pip install -qqU s3fs datasets

Next, enter your Storj S3 compatible access and secret key (see Getting started)

from getpass import getpass
key = getpass('Enter Storj access key')
secret = getpass('Enter Storj secret key')
import s3fs
storage_options={"key":key, "secret":secret, "client_kwargs": {'endpoint_url':"https://gateway.storjshare.io"}}
fs = s3fs.S3FileSystem(**storage_options)

from getpass import getpass
key = getpass('Enter Storj access key')
secret = getpass('Enter Storj secret key')
import s3fs
storage_options={"key":key, "secret":secret, "client_kwargs": {'endpoint_url':"https://gateway.storjshare.io"}}
fs = s3fs.S3FileSystem(**storage_options)

Create a bucket (see Create buckets) from the dataset to be stored in. In this walk-through, the bucket will be called my-dataset-bucket.

Transfer the existing Hugging Face dataset to Storj

If your dataset is already on Hugging Face Hub, you can use the load_dataset_builder function to download and transfer it to Storj. It'll first download raw datasets to your specified cache_dir, then prepare it to uploaded to Storj using the storage_options defined previously.

Here we transfer the dataset imdb to Storj.

from datasets import load_dataset_builder
builder = load_dataset_builder("imdb")

output_dir = "s3://my-dataset-bucket/imdb"
builder = load_dataset_builder("imdb")
builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")

from datasets import load_dataset_builder
builder = load_dataset_builder("imdb")

output_dir = "s3://my-dataset-bucket/imdb"
builder = load_dataset_builder("imdb")
builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")

Save the dataset to Storj

Once you've encoded a dataset, you can persist it using the save_to_disk method.

encoded_dataset.save_to_disk("s3://my-dataset-bucket/imdb/train", storage_options=storage_options)

encoded_dataset.save_to_disk("s3://my-dataset-bucket/imdb/train", storage_options=storage_options)

Load dataset from Storj

Use the load_from_disk method so you can download your datasets.

from datasets import load_from_disk
# load encoded_dataset from cloud storage
dataset = load_from_disk("s3://my-dataset-bucket/imdb/train", storage_options=storage_options)
print(len(dataset))

from datasets import load_from_disk
# load encoded_dataset from cloud storage
dataset = load_from_disk("s3://my-dataset-bucket/imdb/train", storage_options=storage_options)
print(len(dataset))