Dataset Storage and R2 Utilities¶

This document describes the dataset management workflow and the S3-compatible R2 utilities added to the S6 CLI. It covers the s6 dataset master command, the s6 r2 helper commands, and the underlying R2Client with API and behavior details.

Concepts¶

Dataset directory: a folder that contains a data.jsonl file. Examples live under ./temp.
Local root: by default all dataset operations target ./temp.
Remote layout: datasets are stored under the R2 prefix datasets/<name>/ by default (configurable). Keys preserve directory structure on upload/download.

Motivation¶

Research and development loops produce sizable datasets (raw frames, previews, JSONL annotations, logs) that must be shared across local developer laptops, remote teammates, and lab machines. We wanted a single, predictable interface to move datasets without coupling the rest of the codebase to any vendor SDK or filesystem mount.

Cloudflare R2 is S3‑compatible and economical (no egress fees), so we target a generic S3 API and keep the storage endpoint configurable. That gives us:

A uniform path model (keys, prefixes) that maps cleanly to dataset folders.
Credentials via environment variables to avoid hard‑coding secrets.
Swappability (R2 today, MinIO/AWS tomorrow) by only changing the endpoint.
Clear safety semantics (no accidental overwrites) for collaborative work.

In short, the goal is to make “record → upload → iterate → download” as simple and reliable as copying a folder locally, but robust across machines.

Credentials and Defaults¶

R2 is S3-compatible and uses standard S3 credentials. The client resolves credentials/region from environment variables when not explicitly provided.

Access key: R2_ACCESS_KEY_ID or AWS_ACCESS_KEY_ID
Secret key: R2_SECRET_ACCESS_KEY or AWS_SECRET_ACCESS_KEY
Region: R2_REGION_NAME or AWS_REGION or AWS_DEFAULT_REGION (optional)
Bucket default: assets (override with -b/--bucket or R2_BUCKET)
Endpoint default: **** (override with -e/--endpoint or R2_ENDPOINT/R2_ENDPOINT_URL)

All CLI commands accept flags to override bucket/endpoint/region and also honor the environment variables listed above.

Setup¶

Follow these steps to obtain and configure R2 credentials with the standard environment variables we use (R2_ACCESS_KEY_ID and R2_SECRET_ACCESS_KEY).

Create an Access Key in Cloudflare R2

Sign in to the Cloudflare dashboard.
Navigate to R2 → S3 API → Create Access Key (or Manage R2 → Create access key).
Choose permissions (typically Object Read/Write) and optionally restrict to specific buckets.
Copy the generated Access Key ID and Secret Access Key. You will not be able to view the secret again after closing the dialog.

Find your Account ID and Endpoint URL

In the R2 S3 API section, note your Account ID. The endpoint URL is: https://<ACCOUNT_ID>.r2.cloudflarestorage.com
Example used in this repo’s defaults: https://1195172285921be7f47e85de5cc4a5ad.r2.cloudflarestorage.com

Export credentials and endpoint in your shell

# Required credentials
export R2_ACCESS_KEY_ID="<YOUR_ACCESS_KEY_ID>"
export R2_SECRET_ACCESS_KEY="<YOUR_SECRET_ACCESS_KEY>"

# Optional: set endpoint and bucket (overrides CLI defaults)
export R2_ENDPOINT="https://<ACCOUNT_ID>.r2.cloudflarestorage.com"
export R2_BUCKET="assets"

# Optional: set region if needed; R2 typically works with "auto"
export R2_REGION_NAME="auto"

To make these persistent, add the exports to your shell profile (e.g., ~/.zshrc or ~/.bashrc) and reload your shell.

Validate the setup

# List at root (uses defaults if set); add -b/-e to override
s6 r2 list -b "${R2_BUCKET:-assets}" -e "${R2_ENDPOINT}"

# Or list datasets one level deep
s6 r2 list datasets/ --flat -b "${R2_BUCKET:-assets}" -e "${R2_ENDPOINT}"

If these commands return object keys or prefixes (instead of auth errors), your credentials and endpoint are configured correctly.

CLI: `s6 r2` Utilities¶

Location: src/s6/app/r2

Shared flags (on all subcommands):

-b, --bucket: bucket name (default assets)
-e, --endpoint: endpoint URL (default to the endpoint above)
--region: explicit region override (optional)

List¶

Command: s6 r2 list [prefix] [--flat]

Lists objects under an optional prefix. By default, lists recursively.
--flat performs a one-level list and also prints child prefixes.

Examples:

s6 r2 list
s6 r2 list datasets/ --flat
s6 r2 list datasets/my_set/

Download¶

Command: s6 r2 download <object-or-prefix> [-o DIR] [-w N] [-p] [--overwrite]

If argument ends with /, it is a prefix download; otherwise the command auto-detects an exact object vs. a prefix fallback.
-o, --output sets the local destination directory (default temp).
-w, --workers controls parallelism for prefix downloads (default 8).
-p, --progress shows a running progress counter.
--overwrite allows replacing existing local files. Without it, the command aborts early if any target path already exists (fail-fast safety).

Examples:

s6 r2 download datasets/my_set/ -o temp -w 12 -p
s6 r2 download datasets/my_set/data.jsonl -o temp -p

Upload¶

Command: s6 r2 upload <local-path> <dest-key-or-prefix> [-w N] [-p] [--overwrite]

If <local-path> is a directory, uploads recursively under the destination prefix; if the destination does not end with /, one is appended.
If <local-path> is a file and destination ends with /, the basename is appended; otherwise it uploads to the provided key.
-w, --workers controls parallelism for directory uploads (default 8).
-p, --progress enables progress output.
Without --overwrite, upload is fail-fast if any destination keys already exist (preflight check via prefix listing).

Examples:

s6 r2 upload ./temp/diverse_2 datasets/diverse_2/ -w 12 -p
s6 r2 upload ./temp/diverse_2/data.jsonl datasets/diverse_2/data.jsonl -p

Delete¶

Command: s6 r2 delete <key-or-prefix> [-r|--recursive]

Deletes exactly one object unless --recursive is provided or the argument ends with /, in which case it deletes all objects under the prefix.

Examples:

s6 r2 delete datasets/diverse_2/data.jsonl
s6 r2 delete datasets/diverse_2/ -r

CLI: `s6 dataset` Master Command¶

Location: src/s6/app/dataset.py

Global flags (apply to all subcommands):

--root: local root for dataset directories (default temp)
--remote-prefix: remote base prefix (default datasets/)
-w, --workers: parallel workers for upload/download (default 8)
R2 flags: -b/--bucket, -e/--endpoint, --region (same defaults as above)

Subcommands:

List¶

s6 dataset list [--remote] [--local] [--local-only]

Lists dataset names (folders with data.jsonl).
Local: scans --root (default temp).
Remote: lists one-level prefixes under --remote-prefix (default datasets/).
If no --local/--remote flags are passed, local is shown by default.

Upload¶

s6 dataset upload <name> [--overwrite]

Uploads local dataset --root/<name> to remote --remote-prefix/<name>/.
Uses parallel uploads with progress enabled by default.
Without --overwrite, performs a fail-fast preflight: if any destination key already exists, aborts before uploading.

Download¶

s6 dataset download <name> [--overwrite]

Downloads remote dataset --remote-prefix/<name>/ into local --root/<name>.
Uses parallel downloads with progress enabled by default.
Without --overwrite, aborts early if any local target files already exist.
Warns if the downloaded directory does not contain data.jsonl.

Delete¶

s6 dataset delete <name> [-y|--yes] [--local]

Deletes the remote dataset at --remote-prefix/<name>/ (requires --yes).
With --local, also removes the local dataset directory.

R2Client API¶

Location: src/s6/utils/r2.py

Design Overview¶

The storage client and the CLI are layered to keep responsibilities clear:

R2Client (library)
- Thin façade over boto3 S3 client bound to a specific bucket/endpoint.
- Resolves credentials/region from args or environment (R2_* or AWS_*).
- Implements minimal primitives we rely on: list, upload, download, delete.
- Listings paginate and optionally use a delimiter for one‑level views.
- Upload helpers refuse to overwrite by default; directory uploads allow an explicit overwrite=True.
- Existence is checked via HEAD to avoid listing entire buckets.
- Imports are guarded so documentation builds don’t require boto3.
CLI utilities (s6 r2 / s6 dataset)
- Compose R2Client primitives into higher‑level operations with progress and concurrency controls.
- Map local directories to object keys by preserving relative paths under a chosen prefix (datasets/<name>/).
- Perform preflight safety checks to achieve fail‑fast, all‑or‑nothing behavior in collaborative environments.

See the module docs in s6.utils.r2 for API‑level details and examples.

Construction¶

R2Client(
    access_key_id: Optional[str] = None,
    secret_access_key: Optional[str] = None,
    bucket_name: str = "",
    endpoint_url: str = "",
    region_name: str = "auto",
    session: Optional[boto3.session.Session] = None,
)

Credentials/region are read from environment when arguments are None.
region_name="auto" uses R2_REGION_NAME/AWS_REGION/AWS_DEFAULT_REGION when present.

Listing¶

list(prefix: str = "", recursive: bool = True) -> tuple[list[R2Object], list[str]]

Returns (objects, prefixes); prefixes is populated only when recursive=False (one-level list).

R2Object fields:

key: str
size: int
last_modified: Any
etag: str

Download¶

download_file(key: str, local_path: str, *, progress: bool = False) -> None
download_directory(
    prefix: str,
    local_dir: str,
    *,
    max_workers: int = 8,
    progress: bool = False,
    overwrite: bool = False,
) -> None

Directory downloads are parallelized with a thread pool and show aggregate progress when progress=True.
When overwrite=False, a preflight check ensures none of the target files already exist; otherwise it raises FileExistsError (fail-fast).

Upload¶

upload_file(local_path: str, key: str, *, progress: bool = False) -> None
upload_directory(
    local_dir: str,
    prefix: str,
    *,
    overwrite: bool = False,
    max_workers: int = 8,
    progress: bool = False,
) -> None
upload_bytes(data: bytes, key: str) -> None

Directory uploads are parallelized; with progress=True, bytes across all files are aggregated and reported.
When overwrite=False, a preflight uses a single remote listing to detect any destination collisions and aborts early with FileExistsError.
upload_file and upload_bytes never overwrite; delete first to replace.

Delete¶

delete_object(key: str, missing_ok: bool = True) -> None

Deletes an individual object. With missing_ok=False, raises if not found.

Concurrency and Progress¶

Per-transfer concurrency is configured via boto3.s3.transfer.TransferConfig when available; otherwise defaults are used.
Directory-level concurrency is controlled by max_workers on the client methods and CLI flags.
Progress is printed to stderr at ~5 Hz in the form Uploading: X/Y bytes (Z%) or Downloading: ... and concludes with a newline.

Safety Semantics¶

Fail-fast no-overwrite: both upload_directory and download_directory perform preflight checks when overwrite=False and abort before starting any transfers if conflicts are found.
Prefix/object detection: the download CLI treats a trailing / as a prefix; otherwise it tries exact key first and falls back to prefix if appropriate.

Examples¶

Multi‑User Workflow: Record → Upload → Replay¶

This workflow shows how one teammate records a dataset and another replays it remotely using the shared R2 storage.

On Machine A (Recorder):

# 1) Record from live input into a local dataset directory
s6 track -i network -o ./temp/run_net_01 -r -x

# 2) Upload the dataset to shared storage
s6 dataset upload run_net_01
# or equivalently, using r2
s6 r2 upload ./temp/run_net_01 datasets/run_net_01/ -w 12 -p

On Machine B (Consumer):

# 3) Download the dataset locally
s6 dataset download run_net_01
# or with r2
s6 r2 download datasets/run_net_01/ -o ./temp -w 12 -p

# 4) Replay the dataset for development and testing
s6 track -i ./temp/run_net_01 --repeat -x

Tips:

Use --config <file.yaml> with s6 track to test different pipeline configurations against the same dataset.
Keep --repeat on during development for quick iterative cycles.
Profiling with -x writes Chrome trace logs; see docs/recipes/pipeline_chrome_trace.md.

End-to-end roundtrip for a dataset named `diverse_2`¶

# Upload with defaults (parallel + progress)
s6 dataset upload diverse_2

# On another machine, download it
s6 dataset download diverse_2

# Inspect locally and remotely
s6 dataset list
s6 dataset list --remote

# Clean up remotely
s6 dataset delete diverse_2 --yes

Direct R2 usage:

s6 r2 list datasets/ --flat
s6 r2 upload ./temp/diverse_2 datasets/diverse_2/ -w 12 -p
s6 r2 download datasets/diverse_2/ -o temp -w 12 -p
s6 r2 delete datasets/diverse_2/ -r

Troubleshooting¶

Missing credentials: set R2_ACCESS_KEY_ID/R2_SECRET_ACCESS_KEY (or AWS_*) and ensure your endpoint URL is correct for your R2 account.
Partial results: operations should be all-or-nothing when overwrite=False. If you need to replace files, pass --overwrite (download) or --overwrite (dataset/r2 upload). For single-file uploads, delete first.
Performance: increase -w/--workers for more parallelism. Very large datasets benefit from higher worker counts; monitor your network/limits.

Dataset Storage and R2 Utilities¶

Concepts¶

Motivation¶

Credentials and Defaults¶

Setup¶

CLI: s6 r2 Utilities¶

List¶

Download¶

Upload¶

Delete¶

CLI: s6 dataset Master Command¶

List¶

Upload¶

Download¶

Delete¶

R2Client API¶

Design Overview¶

Construction¶

Listing¶

Download¶

Upload¶

Delete¶

Concurrency and Progress¶

Safety Semantics¶

Examples¶

Multi‑User Workflow: Record → Upload → Replay¶

End-to-end roundtrip for a dataset named diverse_2¶

Troubleshooting¶

CLI: `s6 r2` Utilities¶

CLI: `s6 dataset` Master Command¶

End-to-end roundtrip for a dataset named `diverse_2`¶