# Dataset Storage and R2 Utilities This document describes the dataset management workflow and the S3-compatible R2 utilities added to the S6 CLI. It covers the `s6 dataset` master command, the `s6 r2` helper commands, and the underlying `R2Client` with API and behavior details. ## Concepts - Dataset directory: a folder that contains a `data.jsonl` file. Examples live under `./temp`. - Local root: by default all dataset operations target `./temp`. - Remote layout: datasets are stored under the R2 prefix `datasets//` by default (configurable). Keys preserve directory structure on upload/download. ## Motivation Research and development loops produce sizable datasets (raw frames, previews, JSONL annotations, logs) that must be shared across local developer laptops, remote teammates, and lab machines. We wanted a single, predictable interface to move datasets without coupling the rest of the codebase to any vendor SDK or filesystem mount. Cloudflare R2 is S3‑compatible and economical (no egress fees), so we target a generic S3 API and keep the storage endpoint configurable. That gives us: - A uniform path model (keys, prefixes) that maps cleanly to dataset folders. - Credentials via environment variables to avoid hard‑coding secrets. - Swappability (R2 today, MinIO/AWS tomorrow) by only changing the endpoint. - Clear safety semantics (no accidental overwrites) for collaborative work. In short, the goal is to make “record → upload → iterate → download” as simple and reliable as copying a folder locally, but robust across machines. ## Credentials and Defaults R2 is S3-compatible and uses standard S3 credentials. The client resolves credentials/region from environment variables when not explicitly provided. - Access key: `R2_ACCESS_KEY_ID` or `AWS_ACCESS_KEY_ID` - Secret key: `R2_SECRET_ACCESS_KEY` or `AWS_SECRET_ACCESS_KEY` - Region: `R2_REGION_NAME` or `AWS_REGION` or `AWS_DEFAULT_REGION` (optional) - Bucket default: `assets` (override with `-b/--bucket` or `R2_BUCKET`) - Endpoint default: **** (override with `-e/--endpoint` or `R2_ENDPOINT`/`R2_ENDPOINT_URL`) All CLI commands accept flags to override bucket/endpoint/region and also honor the environment variables listed above. ## Setup Follow these steps to obtain and configure R2 credentials with the standard environment variables we use (`R2_ACCESS_KEY_ID` and `R2_SECRET_ACCESS_KEY`). 1) Create an Access Key in Cloudflare R2 - Sign in to the Cloudflare dashboard. - Navigate to R2 → S3 API → Create Access Key (or Manage R2 → Create access key). - Choose permissions (typically Object Read/Write) and optionally restrict to specific buckets. - Copy the generated Access Key ID and Secret Access Key. You will not be able to view the secret again after closing the dialog. 2) Find your Account ID and Endpoint URL - In the R2 S3 API section, note your Account ID. The endpoint URL is: `https://.r2.cloudflarestorage.com` - Example used in this repo’s defaults: `https://1195172285921be7f47e85de5cc4a5ad.r2.cloudflarestorage.com` 3) Export credentials and endpoint in your shell ```bash # Required credentials export R2_ACCESS_KEY_ID="" export R2_SECRET_ACCESS_KEY="" # Optional: set endpoint and bucket (overrides CLI defaults) export R2_ENDPOINT="https://.r2.cloudflarestorage.com" export R2_BUCKET="assets" # Optional: set region if needed; R2 typically works with "auto" export R2_REGION_NAME="auto" ``` To make these persistent, add the exports to your shell profile (e.g., `~/.zshrc` or `~/.bashrc`) and reload your shell. 4) Validate the setup ```bash # List at root (uses defaults if set); add -b/-e to override s6 r2 list -b "${R2_BUCKET:-assets}" -e "${R2_ENDPOINT}" # Or list datasets one level deep s6 r2 list datasets/ --flat -b "${R2_BUCKET:-assets}" -e "${R2_ENDPOINT}" ``` If these commands return object keys or prefixes (instead of auth errors), your credentials and endpoint are configured correctly. ## CLI: `s6 r2` Utilities Location: `src/s6/app/r2` Shared flags (on all subcommands): - `-b, --bucket`: bucket name (default `assets`) - `-e, --endpoint`: endpoint URL (default to the endpoint above) - `--region`: explicit region override (optional) ### List Command: `s6 r2 list [prefix] [--flat]` - Lists objects under an optional prefix. By default, lists recursively. - `--flat` performs a one-level list and also prints child prefixes. Examples: - `s6 r2 list` - `s6 r2 list datasets/ --flat` - `s6 r2 list datasets/my_set/` ### Download Command: `s6 r2 download [-o DIR] [-w N] [-p] [--overwrite]` - If argument ends with `/`, it is a prefix download; otherwise the command auto-detects an exact object vs. a prefix fallback. - `-o, --output` sets the local destination directory (default `temp`). - `-w, --workers` controls parallelism for prefix downloads (default `8`). - `-p, --progress` shows a running progress counter. - `--overwrite` allows replacing existing local files. Without it, the command aborts early if any target path already exists (fail-fast safety). Examples: - `s6 r2 download datasets/my_set/ -o temp -w 12 -p` - `s6 r2 download datasets/my_set/data.jsonl -o temp -p` ### Upload Command: `s6 r2 upload [-w N] [-p] [--overwrite]` - If `` is a directory, uploads recursively under the destination prefix; if the destination does not end with `/`, one is appended. - If `` is a file and destination ends with `/`, the basename is appended; otherwise it uploads to the provided key. - `-w, --workers` controls parallelism for directory uploads (default `8`). - `-p, --progress` enables progress output. - Without `--overwrite`, upload is fail-fast if any destination keys already exist (preflight check via prefix listing). Examples: - `s6 r2 upload ./temp/diverse_2 datasets/diverse_2/ -w 12 -p` - `s6 r2 upload ./temp/diverse_2/data.jsonl datasets/diverse_2/data.jsonl -p` ### Delete Command: `s6 r2 delete [-r|--recursive]` - Deletes exactly one object unless `--recursive` is provided or the argument ends with `/`, in which case it deletes all objects under the prefix. Examples: - `s6 r2 delete datasets/diverse_2/data.jsonl` - `s6 r2 delete datasets/diverse_2/ -r` ## CLI: `s6 dataset` Master Command Location: `src/s6/app/dataset.py` Global flags (apply to all subcommands): - `--root`: local root for dataset directories (default `temp`) - `--remote-prefix`: remote base prefix (default `datasets/`) - `-w, --workers`: parallel workers for upload/download (default `8`) - R2 flags: `-b/--bucket`, `-e/--endpoint`, `--region` (same defaults as above) Subcommands: ### List `s6 dataset list [--remote] [--local] [--local-only]` - Lists dataset names (folders with `data.jsonl`). - Local: scans `--root` (default `temp`). - Remote: lists one-level prefixes under `--remote-prefix` (default `datasets/`). - If no `--local/--remote` flags are passed, local is shown by default. ### Upload `s6 dataset upload [--overwrite]` - Uploads local dataset `--root/` to remote `--remote-prefix//`. - Uses parallel uploads with progress enabled by default. - Without `--overwrite`, performs a fail-fast preflight: if any destination key already exists, aborts before uploading. ### Download `s6 dataset download [--overwrite]` - Downloads remote dataset `--remote-prefix//` into local `--root/`. - Uses parallel downloads with progress enabled by default. - Without `--overwrite`, aborts early if any local target files already exist. - Warns if the downloaded directory does not contain `data.jsonl`. ### Delete `s6 dataset delete [-y|--yes] [--local]` - Deletes the remote dataset at `--remote-prefix//` (requires `--yes`). - With `--local`, also removes the local dataset directory. ## R2Client API Location: `src/s6/utils/r2.py` ### Design Overview The storage client and the CLI are layered to keep responsibilities clear: - R2Client (library) - Thin façade over `boto3` S3 client bound to a specific bucket/endpoint. - Resolves credentials/region from args or environment (R2_* or AWS_*). - Implements minimal primitives we rely on: list, upload, download, delete. - Listings paginate and optionally use a delimiter for one‑level views. - Upload helpers refuse to overwrite by default; directory uploads allow an explicit `overwrite=True`. - Existence is checked via `HEAD` to avoid listing entire buckets. - Imports are guarded so documentation builds don’t require `boto3`. - CLI utilities (s6 r2 / s6 dataset) - Compose R2Client primitives into higher‑level operations with progress and concurrency controls. - Map local directories to object keys by preserving relative paths under a chosen prefix (`datasets//`). - Perform preflight safety checks to achieve fail‑fast, all‑or‑nothing behavior in collaborative environments. See the module docs in `s6.utils.r2` for API‑level details and examples. ### Construction ```python R2Client( access_key_id: Optional[str] = None, secret_access_key: Optional[str] = None, bucket_name: str = "", endpoint_url: str = "", region_name: str = "auto", session: Optional[boto3.session.Session] = None, ) ``` - Credentials/region are read from environment when arguments are `None`. - `region_name="auto"` uses `R2_REGION_NAME`/`AWS_REGION`/`AWS_DEFAULT_REGION` when present. ### Listing ```python list(prefix: str = "", recursive: bool = True) -> tuple[list[R2Object], list[str]] ``` - Returns `(objects, prefixes)`; `prefixes` is populated only when `recursive=False` (one-level list). `R2Object` fields: - `key: str` - `size: int` - `last_modified: Any` - `etag: str` ### Download ```python download_file(key: str, local_path: str, *, progress: bool = False) -> None download_directory( prefix: str, local_dir: str, *, max_workers: int = 8, progress: bool = False, overwrite: bool = False, ) -> None ``` - Directory downloads are parallelized with a thread pool and show aggregate progress when `progress=True`. - When `overwrite=False`, a preflight check ensures none of the target files already exist; otherwise it raises `FileExistsError` (fail-fast). ### Upload ```python upload_file(local_path: str, key: str, *, progress: bool = False) -> None upload_directory( local_dir: str, prefix: str, *, overwrite: bool = False, max_workers: int = 8, progress: bool = False, ) -> None upload_bytes(data: bytes, key: str) -> None ``` - Directory uploads are parallelized; with `progress=True`, bytes across all files are aggregated and reported. - When `overwrite=False`, a preflight uses a single remote listing to detect any destination collisions and aborts early with `FileExistsError`. - `upload_file` and `upload_bytes` never overwrite; delete first to replace. ### Delete ```python delete_object(key: str, missing_ok: bool = True) -> None ``` - Deletes an individual object. With `missing_ok=False`, raises if not found. ### Concurrency and Progress - Per-transfer concurrency is configured via `boto3.s3.transfer.TransferConfig` when available; otherwise defaults are used. - Directory-level concurrency is controlled by `max_workers` on the client methods and CLI flags. - Progress is printed to stderr at ~5 Hz in the form `Uploading: X/Y bytes (Z%)` or `Downloading: ...` and concludes with a newline. ## Safety Semantics - Fail-fast no-overwrite: both `upload_directory` and `download_directory` perform preflight checks when `overwrite=False` and abort before starting any transfers if conflicts are found. - Prefix/object detection: the download CLI treats a trailing `/` as a prefix; otherwise it tries exact key first and falls back to prefix if appropriate. ## Examples ### Multi‑User Workflow: Record → Upload → Replay This workflow shows how one teammate records a dataset and another replays it remotely using the shared R2 storage. On Machine A (Recorder): ```bash # 1) Record from live input into a local dataset directory s6 track -i network -o ./temp/run_net_01 -r -x # 2) Upload the dataset to shared storage s6 dataset upload run_net_01 # or equivalently, using r2 s6 r2 upload ./temp/run_net_01 datasets/run_net_01/ -w 12 -p ``` On Machine B (Consumer): ```bash # 3) Download the dataset locally s6 dataset download run_net_01 # or with r2 s6 r2 download datasets/run_net_01/ -o ./temp -w 12 -p # 4) Replay the dataset for development and testing s6 track -i ./temp/run_net_01 --repeat -x ``` Tips: - Use `--config ` with `s6 track` to test different pipeline configurations against the same dataset. - Keep `--repeat` on during development for quick iterative cycles. - Profiling with `-x` writes Chrome trace logs; see `docs/recipes/pipeline_chrome_trace.md`. ### End-to-end roundtrip for a dataset named `diverse_2` ```bash # Upload with defaults (parallel + progress) s6 dataset upload diverse_2 # On another machine, download it s6 dataset download diverse_2 # Inspect locally and remotely s6 dataset list s6 dataset list --remote # Clean up remotely s6 dataset delete diverse_2 --yes ``` Direct R2 usage: ```bash s6 r2 list datasets/ --flat s6 r2 upload ./temp/diverse_2 datasets/diverse_2/ -w 12 -p s6 r2 download datasets/diverse_2/ -o temp -w 12 -p s6 r2 delete datasets/diverse_2/ -r ``` ## Troubleshooting - Missing credentials: set `R2_ACCESS_KEY_ID`/`R2_SECRET_ACCESS_KEY` (or AWS_*) and ensure your endpoint URL is correct for your R2 account. - Partial results: operations should be all-or-nothing when `overwrite=False`. If you need to replace files, pass `--overwrite` (download) or `--overwrite` (dataset/r2 upload). For single-file uploads, delete first. - Performance: increase `-w/--workers` for more parallelism. Very large datasets benefit from higher worker counts; monitor your network/limits.