Dataset Storage and R2 Utilities

This page documents how s6 stores dataset folders locally and in Cloudflare R2. A dataset is any folder that contains a data.jsonl file.

Storage model

  • Local datasets live under the root directory temp/ by default.

  • Remote datasets live under the prefix datasets/<name>/ by default.

  • The s6 dataset command works with dataset names.

  • The s6 r2 command works with object keys and prefixes directly.

Credentials and defaults

The R2 helpers use S3-compatible credentials and an explicit endpoint.

  • Access key: R2_ACCESS_KEY_ID or AWS_ACCESS_KEY_ID

  • Secret key: R2_SECRET_ACCESS_KEY or AWS_SECRET_ACCESS_KEY

  • Region: R2_REGION_NAME, AWS_REGION, or AWS_DEFAULT_REGION

  • Bucket default: assets

  • Endpoint default: https://1195172285921be7f47e85de5cc4a5ad.r2.cloudflarestorage.com

The command-line helpers also honor:

  • R2_BUCKET

  • R2_ENDPOINT or R2_ENDPOINT_URL

s6 r2

s6 r2 is the lower-level object storage interface in src/s6/app/r2.

list

s6 r2 list [prefix] [--flat]
  • Lists objects under the optional prefix.

  • The default is recursive listing.

  • --flat switches to one-level listing and prints child prefixes first.

download

s6 r2 download <object-or-prefix> [-o DIR] [-w N] [-p] [--overwrite]
  • A trailing / forces prefix download.

  • Otherwise the command first checks for an exact object key, then falls back to prefix download if children exist.

  • Exact object downloads preserve the full key path under the output directory.

  • Prefix downloads preserve the relative structure beneath the prefix.

  • The default output directory is temp/.

  • Without --overwrite, downloads fail fast if any target file already exists.

upload

s6 r2 upload <local-path> <dest-key-or-prefix> [-w N] [-p] [--overwrite]
  • If the source is a directory, the tree is uploaded recursively under the destination prefix.

  • If the destination does not end with /, one is added for directory uploads.

  • If the source is a file and the destination ends with /, the basename is appended; otherwise the provided key is used exactly.

  • File uploads never overwrite existing objects.

  • --overwrite only applies to directory uploads.

  • Without --overwrite, directory uploads fail fast if any destination key already exists.

delete

s6 r2 delete <key-or-prefix> [-r|--recursive]
  • Deletes exactly one object unless --recursive is provided.

  • A trailing / also forces prefix deletion.

s6 dataset

s6 dataset is the named-dataset wrapper in src/s6/app/dataset.py.

list

s6 dataset list [--remote] [--local] [--local-only]
  • Lists dataset names, where a dataset is a folder containing data.jsonl.

  • Local datasets are scanned under --root and are shown by default.

  • --remote adds the remote list under --remote-prefix.

  • --local-only suppresses the remote list.

upload

s6 dataset upload <name> [--overwrite]
  • Uploads --root/<name> to --remote-prefix/<name>/.

  • Uploads are parallel and progress is enabled by default.

  • Without --overwrite, the command fails before uploading if any remote destination key already exists.

download

s6 dataset download <name> [--overwrite]
  • Downloads --remote-prefix/<name>/ into --root/<name>.

  • Downloads are parallel and progress is enabled by default.

  • Without --overwrite, the command fails before downloading if any local target file already exists.

  • If the resulting folder does not contain data.jsonl, the command prints a warning.

delete

s6 dataset delete <name> [-y|--yes] [--local]
  • Deletes the remote dataset under --remote-prefix/<name>/.

  • --yes is required for confirmation.

  • --local also removes the local dataset directory.

R2Client

src/s6/utils/r2.py provides the library layer used by the CLI commands.

Methods

  • list(prefix="", recursive=True)

  • download_file(key, local_path, progress=False)

  • download_directory(prefix, local_dir, max_workers=8, progress=False, overwrite=False)

  • upload_file(local_path, key, progress=False)

  • upload_directory(local_dir, prefix, overwrite=False, max_workers=8, progress=False)

  • upload_bytes(data, key)

  • delete_object(key, missing_ok=True)

Behavior

  • Listings paginate through ListObjectsV2.

  • Non-recursive listings return both objects and child prefixes.

  • Upload and download directory helpers use a thread pool for file-level parallelism.

  • upload_file and upload_bytes never overwrite.

  • upload_directory and download_directory preflight collisions when overwrite=False and raise FileExistsError before any transfer starts.

  • Progress output is printed to stderr when enabled.

  • The module guards optional boto3 imports so docs and tests can import it without cloud dependencies installed.

Example workflow

# Record a dataset locally
s6 track -i gst -o ./temp/run_net_01 -r -x

# Upload it to shared storage
s6 dataset upload run_net_01

# Download it on another machine
s6 dataset download run_net_01

# Replay the dataset
s6 track -i ./temp/run_net_01 --repeat -x

Troubleshooting

  • Missing credentials usually means R2_ACCESS_KEY_ID and R2_SECRET_ACCESS_KEY are unset.

  • If an upload or download fails immediately, check whether the destination already exists and whether --overwrite is needed.

  • For large dataset trees, increase -w/--workers to use more parallel file transfers.