Dataset Storage and R2 Utilities¶
This document describes the dataset management workflow and the S3-compatible
R2 utilities added to the S6 CLI. It covers the s6 dataset master
command, the s6 r2 helper commands, and the underlying R2Client with API
and behavior details.
Concepts¶
Dataset directory: a folder that contains a
data.jsonlfile. Examples live under./temp.Local root: by default all dataset operations target
./temp.Remote layout: datasets are stored under the R2 prefix
datasets/<name>/by default (configurable). Keys preserve directory structure on upload/download.
Motivation¶
Research and development loops produce sizable datasets (raw frames, previews, JSONL annotations, logs) that must be shared across local developer laptops, remote teammates, and lab machines. We wanted a single, predictable interface to move datasets without coupling the rest of the codebase to any vendor SDK or filesystem mount.
Cloudflare R2 is S3ācompatible and economical (no egress fees), so we target a generic S3 API and keep the storage endpoint configurable. That gives us:
A uniform path model (keys, prefixes) that maps cleanly to dataset folders.
Credentials via environment variables to avoid hardācoding secrets.
Swappability (R2 today, MinIO/AWS tomorrow) by only changing the endpoint.
Clear safety semantics (no accidental overwrites) for collaborative work.
In short, the goal is to make ārecord ā upload ā iterate ā downloadā as simple and reliable as copying a folder locally, but robust across machines.
Credentials and Defaults¶
R2 is S3-compatible and uses standard S3 credentials. The client resolves credentials/region from environment variables when not explicitly provided.
Access key:
R2_ACCESS_KEY_IDorAWS_ACCESS_KEY_IDSecret key:
R2_SECRET_ACCESS_KEYorAWS_SECRET_ACCESS_KEYRegion:
R2_REGION_NAMEorAWS_REGIONorAWS_DEFAULT_REGION(optional)Bucket default:
assets(override with-b/--bucketorR2_BUCKET)Endpoint default: **** (override with
-e/--endpointorR2_ENDPOINT/R2_ENDPOINT_URL)
All CLI commands accept flags to override bucket/endpoint/region and also honor the environment variables listed above.
Setup¶
Follow these steps to obtain and configure R2 credentials with the standard
environment variables we use (R2_ACCESS_KEY_ID and R2_SECRET_ACCESS_KEY).
Create an Access Key in Cloudflare R2
Sign in to the Cloudflare dashboard.
Navigate to R2 ā S3 API ā Create Access Key (or Manage R2 ā Create access key).
Choose permissions (typically Object Read/Write) and optionally restrict to specific buckets.
Copy the generated Access Key ID and Secret Access Key. You will not be able to view the secret again after closing the dialog.
Find your Account ID and Endpoint URL
In the R2 S3 API section, note your Account ID. The endpoint URL is:
https://<ACCOUNT_ID>.r2.cloudflarestorage.comExample used in this repoās defaults:
https://1195172285921be7f47e85de5cc4a5ad.r2.cloudflarestorage.com
Export credentials and endpoint in your shell
# Required credentials
export R2_ACCESS_KEY_ID="<YOUR_ACCESS_KEY_ID>"
export R2_SECRET_ACCESS_KEY="<YOUR_SECRET_ACCESS_KEY>"
# Optional: set endpoint and bucket (overrides CLI defaults)
export R2_ENDPOINT="https://<ACCOUNT_ID>.r2.cloudflarestorage.com"
export R2_BUCKET="assets"
# Optional: set region if needed; R2 typically works with "auto"
export R2_REGION_NAME="auto"
To make these persistent, add the exports to your shell profile (e.g.,
~/.zshrc or ~/.bashrc) and reload your shell.
Validate the setup
# List at root (uses defaults if set); add -b/-e to override
s6 r2 list -b "${R2_BUCKET:-assets}" -e "${R2_ENDPOINT}"
# Or list datasets one level deep
s6 r2 list datasets/ --flat -b "${R2_BUCKET:-assets}" -e "${R2_ENDPOINT}"
If these commands return object keys or prefixes (instead of auth errors), your credentials and endpoint are configured correctly.
CLI: s6 r2 Utilities¶
Location: src/s6/app/r2
Shared flags (on all subcommands):
-b, --bucket: bucket name (defaultassets)-e, --endpoint: endpoint URL (default to the endpoint above)--region: explicit region override (optional)
List¶
Command: s6 r2 list [prefix] [--flat]
Lists objects under an optional prefix. By default, lists recursively.
--flatperforms a one-level list and also prints child prefixes.
Examples:
s6 r2 lists6 r2 list datasets/ --flats6 r2 list datasets/my_set/
Download¶
Command: s6 r2 download <object-or-prefix> [-o DIR] [-w N] [-p] [--overwrite]
If argument ends with
/, it is a prefix download; otherwise the command auto-detects an exact object vs. a prefix fallback.-o, --outputsets the local destination directory (defaulttemp).-w, --workerscontrols parallelism for prefix downloads (default8).-p, --progressshows a running progress counter.--overwriteallows replacing existing local files. Without it, the command aborts early if any target path already exists (fail-fast safety).
Examples:
s6 r2 download datasets/my_set/ -o temp -w 12 -ps6 r2 download datasets/my_set/data.jsonl -o temp -p
Upload¶
Command: s6 r2 upload <local-path> <dest-key-or-prefix> [-w N] [-p] [--overwrite]
If
<local-path>is a directory, uploads recursively under the destination prefix; if the destination does not end with/, one is appended.If
<local-path>is a file and destination ends with/, the basename is appended; otherwise it uploads to the provided key.-w, --workerscontrols parallelism for directory uploads (default8).-p, --progressenables progress output.Without
--overwrite, upload is fail-fast if any destination keys already exist (preflight check via prefix listing).
Examples:
s6 r2 upload ./temp/diverse_2 datasets/diverse_2/ -w 12 -ps6 r2 upload ./temp/diverse_2/data.jsonl datasets/diverse_2/data.jsonl -p
Delete¶
Command: s6 r2 delete <key-or-prefix> [-r|--recursive]
Deletes exactly one object unless
--recursiveis provided or the argument ends with/, in which case it deletes all objects under the prefix.
Examples:
s6 r2 delete datasets/diverse_2/data.jsonls6 r2 delete datasets/diverse_2/ -r
CLI: s6 dataset Master Command¶
Location: src/s6/app/dataset.py
Global flags (apply to all subcommands):
--root: local root for dataset directories (defaulttemp)--remote-prefix: remote base prefix (defaultdatasets/)-w, --workers: parallel workers for upload/download (default8)R2 flags:
-b/--bucket,-e/--endpoint,--region(same defaults as above)
Subcommands:
List¶
s6 dataset list [--remote] [--local] [--local-only]
Lists dataset names (folders with
data.jsonl).Local: scans
--root(defaulttemp).Remote: lists one-level prefixes under
--remote-prefix(defaultdatasets/).If no
--local/--remoteflags are passed, local is shown by default.
Upload¶
s6 dataset upload <name> [--overwrite]
Uploads local dataset
--root/<name>to remote--remote-prefix/<name>/.Uses parallel uploads with progress enabled by default.
Without
--overwrite, performs a fail-fast preflight: if any destination key already exists, aborts before uploading.
Download¶
s6 dataset download <name> [--overwrite]
Downloads remote dataset
--remote-prefix/<name>/into local--root/<name>.Uses parallel downloads with progress enabled by default.
Without
--overwrite, aborts early if any local target files already exist.Warns if the downloaded directory does not contain
data.jsonl.
Delete¶
s6 dataset delete <name> [-y|--yes] [--local]
Deletes the remote dataset at
--remote-prefix/<name>/(requires--yes).With
--local, also removes the local dataset directory.
R2Client API¶
Location: src/s6/utils/r2.py
Design Overview¶
The storage client and the CLI are layered to keep responsibilities clear:
R2Client (library)
Thin faƧade over
boto3S3 client bound to a specific bucket/endpoint.Resolves credentials/region from args or environment (R2_* or AWS_*).
Implements minimal primitives we rely on: list, upload, download, delete.
Listings paginate and optionally use a delimiter for oneālevel views.
Upload helpers refuse to overwrite by default; directory uploads allow an explicit
overwrite=True.Existence is checked via
HEADto avoid listing entire buckets.Imports are guarded so documentation builds donāt require
boto3.
CLI utilities (s6 r2 / s6 dataset)
Compose R2Client primitives into higherālevel operations with progress and concurrency controls.
Map local directories to object keys by preserving relative paths under a chosen prefix (
datasets/<name>/).Perform preflight safety checks to achieve failāfast, allāorānothing behavior in collaborative environments.
See the module docs in s6.utils.r2 for APIālevel details and examples.
Construction¶
R2Client(
access_key_id: Optional[str] = None,
secret_access_key: Optional[str] = None,
bucket_name: str = "",
endpoint_url: str = "",
region_name: str = "auto",
session: Optional[boto3.session.Session] = None,
)
Credentials/region are read from environment when arguments are
None.region_name="auto"usesR2_REGION_NAME/AWS_REGION/AWS_DEFAULT_REGIONwhen present.
Listing¶
list(prefix: str = "", recursive: bool = True) -> tuple[list[R2Object], list[str]]
Returns
(objects, prefixes);prefixesis populated only whenrecursive=False(one-level list).
R2Object fields:
key: strsize: intlast_modified: Anyetag: str
Download¶
download_file(key: str, local_path: str, *, progress: bool = False) -> None
download_directory(
prefix: str,
local_dir: str,
*,
max_workers: int = 8,
progress: bool = False,
overwrite: bool = False,
) -> None
Directory downloads are parallelized with a thread pool and show aggregate progress when
progress=True.When
overwrite=False, a preflight check ensures none of the target files already exist; otherwise it raisesFileExistsError(fail-fast).
Upload¶
upload_file(local_path: str, key: str, *, progress: bool = False) -> None
upload_directory(
local_dir: str,
prefix: str,
*,
overwrite: bool = False,
max_workers: int = 8,
progress: bool = False,
) -> None
upload_bytes(data: bytes, key: str) -> None
Directory uploads are parallelized; with
progress=True, bytes across all files are aggregated and reported.When
overwrite=False, a preflight uses a single remote listing to detect any destination collisions and aborts early withFileExistsError.upload_fileandupload_bytesnever overwrite; delete first to replace.
Delete¶
delete_object(key: str, missing_ok: bool = True) -> None
Deletes an individual object. With
missing_ok=False, raises if not found.
Concurrency and Progress¶
Per-transfer concurrency is configured via
boto3.s3.transfer.TransferConfigwhen available; otherwise defaults are used.Directory-level concurrency is controlled by
max_workerson the client methods and CLI flags.Progress is printed to stderr at ~5 Hz in the form
Uploading: X/Y bytes (Z%)orDownloading: ...and concludes with a newline.
Safety Semantics¶
Fail-fast no-overwrite: both
upload_directoryanddownload_directoryperform preflight checks whenoverwrite=Falseand abort before starting any transfers if conflicts are found.Prefix/object detection: the download CLI treats a trailing
/as a prefix; otherwise it tries exact key first and falls back to prefix if appropriate.
Examples¶
MultiāUser Workflow: Record ā Upload ā Replay¶
This workflow shows how one teammate records a dataset and another replays it remotely using the shared R2 storage.
On Machine A (Recorder):
# 1) Record from live input into a local dataset directory
s6 track -i network -o ./temp/run_net_01 -r -x
# 2) Upload the dataset to shared storage
s6 dataset upload run_net_01
# or equivalently, using r2
s6 r2 upload ./temp/run_net_01 datasets/run_net_01/ -w 12 -p
On Machine B (Consumer):
# 3) Download the dataset locally
s6 dataset download run_net_01
# or with r2
s6 r2 download datasets/run_net_01/ -o ./temp -w 12 -p
# 4) Replay the dataset for development and testing
s6 track -i ./temp/run_net_01 --repeat -x
Tips:
Use
--config <file.yaml>withs6 trackto test different pipeline configurations against the same dataset.Keep
--repeaton during development for quick iterative cycles.Profiling with
-xwrites Chrome trace logs; seedocs/recipes/pipeline_chrome_trace.md.
End-to-end roundtrip for a dataset named diverse_2¶
# Upload with defaults (parallel + progress)
s6 dataset upload diverse_2
# On another machine, download it
s6 dataset download diverse_2
# Inspect locally and remotely
s6 dataset list
s6 dataset list --remote
# Clean up remotely
s6 dataset delete diverse_2 --yes
Direct R2 usage:
s6 r2 list datasets/ --flat
s6 r2 upload ./temp/diverse_2 datasets/diverse_2/ -w 12 -p
s6 r2 download datasets/diverse_2/ -o temp -w 12 -p
s6 r2 delete datasets/diverse_2/ -r
Troubleshooting¶
Missing credentials: set
R2_ACCESS_KEY_ID/R2_SECRET_ACCESS_KEY(or AWS_*) and ensure your endpoint URL is correct for your R2 account.Partial results: operations should be all-or-nothing when
overwrite=False. If you need to replace files, pass--overwrite(download) or--overwrite(dataset/r2 upload). For single-file uploads, delete first.Performance: increase
-w/--workersfor more parallelism. Very large datasets benefit from higher worker counts; monitor your network/limits.